Contents

1 Install required packages

Install bigWig and latticeExtra package

install.packages("devtools", quiet = TRUE)
## also installing the dependencies 'Rcpp', 'utf8', 'askpass', 'credentials', 'openssl', 'sys', 'zip', 'gitcreds', 'httr2', 'ini', 'httpuv', 'xtable', 'sourcetools', 'later', 'promises', 'fansi', 'systemfonts', 'textshaping', 'pillar', 'pkgconfig', 'diffobj', 'rematch2', 'clipr', 'crayon', 'curl', 'gert', 'gh', 'purrr', 'rprojroot', 'rstudioapi', 'whisker', 'shiny', 'callr', 'processx', 'downlit', 'httr', 'ragg', 'tibble', 'xml2', 'htmlwidgets', 'prettyunits', 'xopen', 'brew', 'commonmark', 'cpp11', 'brio', 'praise', 'ps', 'waldo', 'usethis', 'desc', 'miniUI', 'pkgbuild', 'pkgdown', 'pkgload', 'profvis', 'rcmdcheck', 'remotes', 'roxygen2', 'rversions', 'sessioninfo', 'testthat', 'urlchecker', 'withr'
library(devtools)
## Loading required package: usethis
devtools::install_github('andrelmartins/bigWig',
              subdir='bigWig')
## Downloading GitHub repo andrelmartins/bigWig@HEAD
## ── R CMD build ─────────────────────────────────────────────────────────────────
## * checking for file ‘/private/tmp/RtmpTSeCGC/remotese4d6dcdf5c9/andrelmartins-bigWig-beac5a7/bigWig/DESCRIPTION’ ... OK
## * preparing ‘bigWig’:
## * checking DESCRIPTION meta-information ... OK
## * cleaning src
## * checking for LF line-endings in source and make files and shell scripts
## * checking for empty or unneeded directories
## * building ‘bigWig_0.2-9.tar.gz’
library(bigWig)

install.packages("latticeExtra", quiet = TRUE)
## also installing the dependencies 'deldir', 'RcppEigen', 'png', 'jpeg', 'RColorBrewer', 'interp'
install.packages("DESeq2", quiet = TRUE)
## Warning: package 'DESeq2' is not available for this version of R
## 
## A version of this package for your version of R might be available elsewhere,
## see the ideas at
## https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages

Install bedtools

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
brew install bedtools

2 Jan 2nd

2.1 Summary of Exhaustive MEME/STREM/MAST analysis:

We begin by using a 101bp window centered around the peak summit and employ MEME/STREME software to identify potential GATA3 binding sites. MEME (classic) identified six GATA3 motifs within the GATA3 ChIP peak set. Subsequent MEME analysis revealed no additional GATA3-matched motifs in peaks without the initial 6 motifs discovered by MEME.
On the other hand, STREME detected two GATA3 motifs in peaks where the 6 MEME-found motifs were absent.

We then expanded the search window from 101bp to 161bp and used MEME and STREME to identify potential binding sites within the full 161bp window and the spanned region only (with the central 81bp swapped to N).

With STREME, we discovered two motifs within the spanned region. These motifs resembled the previously found motifs within the 101bp window. As a result, we concluded that the 101bp window was insufficient for capturing all potential GATA3 binding sites.

Subsequently, we employed MAST to locate peaks containing the previous 8 motifs within the 161bp window. For peaks that didn’t overlap with any motif coordinates, we increased their width from 161bp to 181bp and conducted the same MEME/STREME analysis. However, this time, no GATA3-like motifs were found.

In my GATA3 peak set, GATA3 appears to preferentially bind to the central 161bp window. Despite this, MEME/STREME failed to identify GATA3 binding elements in 38.5% of the peaks.

2.2 Motif Occurence in GATA peaks

stack bar plot
a bar plot showing peaks contain GATA-like motifs, and other peaks

wc -l  *round*.bed
# 11427 mast_GATA3_PSWM_in_peaks_round1.bed
#  10460 mast_GATA3_PSWM_in_peaks_round2.bed
#   5202 mast_GATA3_PSWM_in_peaks_round3.bed
#   5665 mast_GATA3_PSWM_in_peaks_round4.bed
#   3564 mast_GATA3_PSWM_in_peaks_round5.bed
#   4510 mast_GATA3_PSWM_in_peaks_round6.bed
#   8733 mast_GATA3_PSWM_in_peaks_round7.bed
#   4452 mast_GATA3_PSWM_in_peaks_round8.bed
#   6002 mast_GATA3_PSWM_in_peaks_round9.bed --161bp window
#  60015 total

   
counts=$(wc -l without_motifs_123456_78_161bp_mast.bed | awk -F"without" '{print $1}')  #37308
label1='others'
echo $counts "" $label1 >> bar.csv
total_counts=$(wc -l /home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/peak_call/GATA_ChIP_summit_100window.bed | awk -F"/" '{print $1}')  #96868
counts2=$(echo "$total_counts - $counts" | bc)
label2='GATA3-like'
echo $counts2 "" $label2 >> bar.csv
#module load R/4.1.2 
#R 

library("lattice") 
bar=read.csv('bar.csv', sep = "", header=F)
colnames(bar)=c("peak_numbers", "enriched_motifs")
bar$dum.x="peak number"
bar$enriched_motif<- factor(bar$enriched_motif, levels = c("GATA3-like","others"))

pdf('240103_peak_number_with_or_without_GATA3_motif_variant.pdf', width=5,height=6)
my.settings <- list(
  #superpose.polygon=list(col=c(colorRampPalette(c("red","pink"))(6),colorRampPalette(c("blue","light blue"))(3), "light grey"), border="transparent"),
  superpose.polygon=list(col=c("red","light grey"), border="transparent"),
  strip.background=list(col="grey80", cex = 0.6),
  strip.border=list(col="black")
)
print(barchart(peak_numbers ~ dum.x,         
         data = bar,
         groups = enriched_motif,
         stack = TRUE,
         auto.key=list(space="right"),
         #scales = list(x = list(rot = 45)),
         ylab = "peak set within intensity quantile",
         xlab = "number of peaks with/without GATA3 motif variant",
         par.settings = my.settings)
)
dev.off()

png('240103_peak_number_with_or_without_GATA3_motif_variant.png')
my.settings <- list(
  #superpose.polygon=list(col=c(colorRampPalette(c("red","pink"))(6),colorRampPalette(c("blue","light blue"))(3), "light grey"), border="transparent"),
  superpose.polygon=list(col=c("red","light grey"), border="transparent"),
  strip.background=list(col="grey80", cex = 0.6),
  strip.border=list(col="black")
)
print(barchart(peak_numbers ~ dum.x,         
         data = bar,
         groups = enriched_motif,
         stack = TRUE,
         auto.key=list(space="right"),
         #scales = list(x = list(rot = 45)),
         ylab = "peak set within intensity quantile",
         xlab = "number of peaks with/without GATA3 motif variant",
         par.settings = my.settings)
)
dev.off()
percent peaks with or without GATA3 motif variants

Figure 1: percent peaks with or without GATA3 motif variants

bar plot (group by peak intensity)
The fraction of peaks with MEME-found-motif 123456, as well as STREME-found motif78, in each quantile:

# quantile file is 1bp summit file, expanded to 161window
#module load R/4.1.2 
#R
library(bigWig)

for (chip.peak in Sys.glob(file.path("/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/peak_Intensity/GATA_Deseq2/quantile*summits.bed")))  {
    print(chip.peak)
    quantile.name = strsplit(strsplit(chip.peak, "/")[[1]][length(strsplit(chip.peak, "/")[[1]])], '_summits.bed')[[1]][1]
    print(quantile.name)
    chip.peaks=read.table(chip.peak, header=FALSE)
    chip.peak.161win=center.bed(chip.peaks, upstreamWindow=80, downstreamWindow=80)
    write.table(chip.peak.161win,file= paste0(quantile.name, '_summits_161window.bed'), quote=F,sep="\t",col.names=F,row.names=F)
    }
module load bedtools
dir="/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/peak_Intensity/GATA_Deseq2/"

# quantile peaks with individual motif
for i in ${dir}*quantile*161window.bed
do
  nm=$(echo $i | awk -F $dir '{print $2}' | awk -F "_summits_161window.bed" '{print $1}')
  intersectBed -wa -a $i -b ../without_motifs_123456_78_161bp_mast.bed | sort -k1,1 -k2,2n | uniq >${nm}_without_motifs_123456_78_161bp_mast.bed
done


touch quantile.240103.sum.161window.txt
for i in *quantile*_without_motifs*.bed
do
 nm=$(echo $i | awk -F"_without_motifs_123456_78_161bp_mast.bed" '{print $1}')
 p=$(wc -l $i | awk '{OFS="\t";} {print $1}')
 totalp=$(wc -l ${dir}${nm}_summits.bed | awk '{OFS="\t";} {print $1}')
 withoutMotif=$(bc <<< "scale=4 ; $p / $totalp")
 withMotif=$(bc <<< "scale=4 ; ($totalp- $p) / $totalp")
 echo $nm "" $withoutMotif "" $withMotif>> quantile.240103.sum.161window.txt
done
#module load R/4.1.2 
#R 

library("lattice") 
library("reshape2")


df_sum=read.table('quantile.240103.sum.161window.txt', sep = "", header=FALSE)
colnames(df_sum)=c("quantile","without_Motif","with_Motif")
df_sum_long <- df_sum

df_sum_long <- melt(df_sum_long, id = "quantile")
df_sum_long$variable<- factor(df_sum_long$variable, levels = c("with_Motif", "without_Motif"))

pdf('240103_percent_peaks_withorwithout_GATA3_motif_variant.pdf',width=8,height=5)
my.settings <- list(
  superpose.polygon=list(col=c("black", "grey"), border="transparent"),
  strip.background=list(col="grey80", cex = 0.6),
  strip.border=list(col="black")
)
print(barchart(value ~ quantile,         
         data = df_sum_long,
         groups = variable,
         stack = TRUE,
         auto.key=list(space="right"),
         scales = list(x = list(rot = 45)),
         ylab = "peak set within intensity quantile",
         xlab = "fraction of peaks with/without GATA3 motif variant",
         par.settings = my.settings)
)
dev.off()

png('240103_percent_peaks_withorwithout_GATA3_motif_variant.png')
my.settings <- list(
  superpose.polygon=list(col=c("black", "grey"), border="transparent"),
  strip.background=list(col="grey80", cex = 0.6),
  strip.border=list(col="black")
)
print(barchart(value ~ quantile,         
         data = df_sum_long,
         groups = variable,
         stack = TRUE,
         auto.key=list(space="right"),
         scales = list(x = list(rot = 45)),
         ylab = "peak set within intensity quantile",
         xlab = "fraction of peaks with/without GATA3 motif variant",
         par.settings = my.settings)
)
dev.off()
percent peaks with or without GATA3 motif (variant 1,2,3,4,5,6,7,8,9):
percent peaks with or without GATA3 motif variants

Figure 2: percent peaks with or without GATA3 motif variants

For peaks in the top quantile intensity (top 5% highest intensity), we have ~16% peaks MEME/STREME are not able to find GATA3 binding elements.

2.3 ENCODE DHS in MCF7 cells

2.3.1 Retrive the ENCODE DNAse-seq data for MCF7 cells (Refer to GSE29692)

Stam_MCF-7_1: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM736581
Stam_MCF-7_2: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM736588

IMPORTANT
1) Notice that these data are using hg19 genome, we need to use UCSC liftover to convert data to hg38.
2) While processing the negative control (DHS regions), we want to remove the 8 motifs first, use same window (161bp) and same mast p value.

# 173277 MCF7-DS12619.peaks.fdr0.01.hg19.bed
head -5 MCF7-DS12619.peaks.fdr0.01.hg19.bed
#chr1   10180   10330   .   0   .   12  5.68855 -1  -1
#chr1   16160   16310   .   0   .   11  4.49825 -1  -1
#chr1   237660  237810  .   0   .   41  33.423  -1  -1
#chr1   521440  521590  .   0   .   66  76.4085 -1  -1
#chr1   565560  565710  .   0   .   91  20.8239 -1  -1

#126717 MCF7-DS9445.peaks.fdr0.01.hg19.bed
head -5 MCF7-DS9445.peaks.fdr0.01.hg19.bed
#chr1   237720  237870  .   0   .   11  7.10762 -1  -1
#chr1   521440  521590  .   0   .   16  15.6231 -1  -1
#chr1   565280  565430  .   0   .   41  11.9262 -1  -1
#chr1   565540  565690  .   0   .   54  39.9645 -1  -1
#chr1   565860  566010  .   0   .   95  39.9645 -1  -1

Process the file, keep only the coordinates info. (liftover has format requirements).

awk '{OFS="\t"} {print $1, $2, $3}' MCF7-DS12619.peaks.fdr0.01.hg19.bed > MCF7_hg19_Encode_DHS_Rep2.bed
awk '{OFS="\t"} {print $1, $2, $3}' MCF7-DS9445.peaks.fdr0.01.hg19.bed > MCF7_hg19_Encode_DHS_Rep1.bed
head -5 MCF7_hg19_Encode_DHS_Rep2.bed #173277
#chr1   10180   10330
#chr1   16160   16310
#chr1   237660  237810
#chr1   521440  521590
#chr1   565560  565710

head -5 MCF7_hg19_Encode_DHS_Rep1.bed #126717
#chr1   237720  237870
#chr1   521440  521590
#chr1   565280  565430
#chr1   565540  565690
#chr1   565860  566010

Then we go to the UCSC liftover online tool to convert the hg19 regions to hg38 regions.
https://genome.ucsc.edu/cgi-bin/hgLiftOver

head -5 MCF7_hg38_Encode_DHS_Rep2.bed # 173226
#chr1   10180   10330   chr1:10181-10330    1
#chr1   16160   16310   chr1:16161-16310    1
#chr1   267909  268059  chr1:237661-237810  1
#chr1   586060  586210  chr1:521441-521590  1
#chr1   630180  630330  chr1:565561-565710  1

head -5 MCF7_hg38_Encode_DHS_Rep1.bed # 126690
#chr1   267969  268119  chr1:237721-237870  1
#chr1   586060  586210  chr1:521441-521590  1
#chr1   629900  630050  chr1:565281-565430  1
#chr1   630160  630310  chr1:565541-565690  1
#chr1   630480  630630  chr1:565861-566010  1

Notice that, during liftover process, there are chance to lose some regions.

Remove the original hg19 coordinates info, and keep only the hg38 coordinates.

awk '{OFS="\t"} {print $1, $2, $3}' MCF7_hg38_Encode_DHS_Rep2.bed > MCF7_hg38_Encode_DHS_Rep2_peak.bed
awk '{OFS="\t"} {print $1, $2, $3}' MCF7_hg38_Encode_DHS_Rep1.bed > MCF7_hg38_Encode_DHS_Rep1_peak.bed

wc -l MCF7_hg38_Encode_DHS_Rep2.bed #173226
wc -l MCF7_hg38_Encode_DHS_Rep2_peak.bed #173226
head -5 MCF7_hg38_Encode_DHS_Rep2_peak.bed
#chr1   10180   10330
#chr1   16160   16310
#chr1   267909  268059
#chr1   586060  586210
#chr1   630180  630330

wc -l MCF7_hg38_Encode_DHS_Rep1.bed #126690
wc -l MCF7_hg38_Encode_DHS_Rep1_peak.bed #126690
head -5 MCF7_hg38_Encode_DHS_Rep1_peak.bed
#chr1   267969  268119
#chr1   586060  586210
#chr1   629900  630050
#chr1   630160  630310
#chr1   630480  630630

2.3.2 remove regions overlap with GATA3 binding sites

make sure the DHS regions are 161bp window

library(bigWig)
peak.region.summit1=center.bed(read.table('MCF7_hg38_Encode_DHS_Rep1_peak.bed', sep = "\t", header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
peak.region.summit2=center.bed(read.table('MCF7_hg38_Encode_DHS_Rep2_peak.bed', sep = "\t", header=FALSE), upstreamWindow = 0, downstreamWindow = 0)

peak.region.161bp.1=center.bed(peak.region.summit1, upstreamWindow = 80, downstreamWindow = 80)
peak.region.161bp.2=center.bed(peak.region.summit2, upstreamWindow = 80, downstreamWindow = 80)
head(peak.region.161bp.1)
##     V1     V2     V3
## 1 chr1 267964 268125
## 2 chr1 586055 586216
## 3 chr1 629895 630056
## 4 chr1 630155 630316
## 5 chr1 630475 630636
## 6 chr1 631375 631536
head(peak.region.161bp.2)
##     V1     V2     V3
## 1 chr1  10175  10336
## 2 chr1  16155  16316
## 3 chr1 267904 268065
## 4 chr1 586055 586216
## 5 chr1 630175 630336
## 6 chr1 630475 630636
write.table(peak.region.161bp.1,file= 'MCF7_hg38_Encode_DHS_Rep1_161bp_peak.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(peak.region.161bp.2,file= 'MCF7_hg38_Encode_DHS_Rep2_161bp_peak.bed', quote=F,sep="\t",col.names=F,row.names=F)

remove GATA3 motifs with MAST

The DHS regions contain all regulatory regions that are sensitive to DNAse. To make a independent negative control, we use MAST to remove regulatory regions that overlap with GATA3-like motif 12345678 (use same p-value stringency).

First convert regulatory region file from .bed to .fasta.

module load bedtools
genome=/home/FCAM/ssun/Genome/hg38.fa

fastaFromBed -fi $genome -bed MCF7_hg38_Encode_DHS_Rep1_161bp_peak.bed -fo MCF7_hg38_Encode_DHS_Rep1_161bp_peak.fasta
fastaFromBed -fi $genome -bed MCF7_hg38_Encode_DHS_Rep2_161bp_peak.bed -fo MCF7_hg38_Encode_DHS_Rep2_161bp_peak.fasta
head -3 MCF7_hg38_Encode_DHS_Rep1_161bp_peak.bed
head -6 MCF7_hg38_Encode_DHS_Rep1_161bp_peak.fasta
head -3 MCF7_hg38_Encode_DHS_Rep2_161bp_peak.bed
head -6 MCF7_hg38_Encode_DHS_Rep2_161bp_peak.fasta
## chr1 267964  268125
## chr1 586055  586216
## chr1 629895  630056
## >chr1:267964-268125
## GGAGACTGATGTGGTTTCTCCTCAGTTTCTCTGTGCAGCACCAGGTGGCAGCAGAGGTCAGCAAGGCAAACCCGAGCCCGGGGATGCGGAGTGGGGGCAGCTACGTCCTCTCTTGAGCTACAGCAGATTCACTCTGTTCTGTTTCATTGTTGTTTAGTTTG
## >chr1:586055-586216
## TTTCCCACATTATTCAGCTTCTGAAAGGGTTGCTTGACCCACAGATGTGAAGCTGAGGCTGAAGGAGACTGATGTGGTTTCTCCTCAGTTTCTCTGTGCGGCACCAGGTGGCAGCAGAGGTCAGCAAGGCAAACCCGAGCCCGGGGATGCGGGGTGGGGGC
## >chr1:629895-630056
## TAACCAATACCACCAATCAATACTCATCATTAATAATCATAATGGCTATAGCAATAAAACTAGGAATAGCCCCCTTTCACTTCTGAGTCCCAGAGGTTACCCAAGGCACCCCTCTGACATCCGGCCTGCTCCTTCTCACATGACAAAAACTAGCCCCCATC
## chr1 10175   10336
## chr1 16155   16316
## chr1 267904  268065
## >chr1:10175-10336
## aacctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaacccctaaccctaaccctaaaccctaaaccctaaccctaaccctaaccctaaccctaaccccaaccccaaccccaaccccaaccccaaccccaaccctaacccctaa
## >chr1:16155-16316
## CAGATACTCCCTGCTTCCTCTCTAGCCCCCACCCTGCAGAGCTGGACCCCTGAGCTAGCCATGCTCTGACAGTCTCAGTTGCACACACGAGCCAGCAGAGGGGTTTTGTGCCACTTCTGGATGCTAGGGTTACACTGGGAGACACAGCAGTGAAGCTGAAA
## >chr1:267904-268065
## CCCACATTATACAGCTTCTGAAAGGGTTGCTTGACCCACAGATGTGAAGCTGAGGCTGAAGGAGACTGATGTGGTTTCTCCTCAGTTTCTCTGTGCAGCACCAGGTGGCAGCAGAGGTCAGCAAGGCAAACCCGAGCCCGGGGATGCGGAGTGGGGGCAGC

We have 8 motifs that found by MEME/STREME software.

ls individual_meme/
## AGATAAM_streme.txt
## GATAmotif1_meme.txt
## GATAmotif2_meme.txt
## GATAmotif3_meme.txt
## GATAmotif4_meme.txt
## GATAmotif5_meme.txt
## GATAmotif6_meme.txt
## TGATAA_streme.txt

MCF7_hg38_Encode_DHS_Rep1_161bp_peak.fasta

MEME (mast uses default p-value: 0.0001)

module load meme/5.4.1
module load R/4.1.2
module load bedtools
genome=/home/FCAM/ssun/Genome/hg38.fa
dir=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/Exhaustive_MEME_MAST_round1to8/individual_meme/

#round1
mast -hit_list -best ${dir}GATAmotif1_meme.txt MCF7_hg38_Encode_DHS_Rep1_161bp_peak.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round1.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round1.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round1.bed #5836
intersectBed -v -a MCF7_hg38_Encode_DHS_Rep1_161bp_peak.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round1.bed > MCF7DHS_Rep1_161bp_without_motifs_1.bed
wc -l MCF7DHS_Rep1_161bp_without_motifs_1.bed #120854

#round2
fastaFromBed -fi $genome -bed MCF7DHS_Rep1_161bp_without_motifs_1.bed -fo MCF7DHS_Rep1_161bp_without_motifs_1.fasta
mast -hit_list -best ${dir}GATAmotif2_meme.txt MCF7DHS_Rep1_161bp_without_motifs_1.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round2.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round2.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round2.bed #4978
intersectBed -v -a MCF7DHS_Rep1_161bp_without_motifs_1.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round2.bed > MCF7DHS_Rep1_161bp_without_motifs_12.bed
wc -l MCF7DHS_Rep1_161bp_without_motifs_12.bed #115876

#round3
fastaFromBed -fi $genome -bed MCF7DHS_Rep1_161bp_without_motifs_12.bed -fo MCF7DHS_Rep1_161bp_without_motifs_12.fasta
mast -hit_list -best ${dir}GATAmotif3_meme.txt MCF7DHS_Rep1_161bp_without_motifs_12.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round3.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round3.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round3.bed #2263
intersectBed -v -a MCF7DHS_Rep1_161bp_without_motifs_12.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round3.bed > MCF7DHS_Rep1_161bp_without_motifs_123.bed
wc -l MCF7DHS_Rep1_161bp_without_motifs_123.bed #113613

#round4
fastaFromBed -fi $genome -bed MCF7DHS_Rep1_161bp_without_motifs_123.bed -fo MCF7DHS_Rep1_161bp_without_motifs_123.fasta
mast -hit_list -best ${dir}GATAmotif4_meme.txt MCF7DHS_Rep1_161bp_without_motifs_123.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round4.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round4.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round4.bed #3237
intersectBed -v -a MCF7DHS_Rep1_161bp_without_motifs_123.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round4.bed > MCF7DHS_Rep1_161bp_without_motifs_1234.bed
wc -l MCF7DHS_Rep1_161bp_without_motifs_1234.bed #110376

#round5
fastaFromBed -fi $genome -bed MCF7DHS_Rep1_161bp_without_motifs_1234.bed -fo MCF7DHS_Rep1_161bp_without_motifs_1234.fasta
mast -hit_list -best ${dir}GATAmotif5_meme.txt MCF7DHS_Rep1_161bp_without_motifs_1234.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round5.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round5.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round5.bed #2629
intersectBed -v -a MCF7DHS_Rep1_161bp_without_motifs_1234.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round5.bed > MCF7DHS_Rep1_161bp_without_motifs_12345.bed
wc -l MCF7DHS_Rep1_161bp_without_motifs_12345.bed #107747


#round6
fastaFromBed -fi $genome -bed MCF7DHS_Rep1_161bp_without_motifs_12345.bed -fo MCF7DHS_Rep1_161bp_without_motifs_12345.fasta
mast -hit_list -best ${dir}GATAmotif6_meme.txt MCF7DHS_Rep1_161bp_without_motifs_12345.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round6.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round6.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round6.bed #2800
intersectBed -v -a MCF7DHS_Rep1_161bp_without_motifs_12345.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round6.bed > MCF7DHS_Rep1_161bp_without_motifs_123456.bed
wc -l MCF7DHS_Rep1_161bp_without_motifs_123456.bed #104947

STREME (mast uses p-value of 0.0005)

#round7
fastaFromBed -fi $genome -bed MCF7DHS_Rep1_161bp_without_motifs_123456.bed -fo MCF7DHS_Rep1_161bp_without_motifs_123456.fasta
mast -mt 0.0005 -hit_list -best ${dir}AGATAAM_streme.txt MCF7DHS_Rep1_161bp_without_motifs_123456.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round7.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round7.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round7.bed #8176
intersectBed -v -a MCF7DHS_Rep1_161bp_without_motifs_123456.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round7.bed > MCF7DHS_Rep1_161bp_without_motifs_123456_7.bed
wc -l MCF7DHS_Rep1_161bp_without_motifs_123456_7.bed #96771


#round8
fastaFromBed -fi $genome -bed MCF7DHS_Rep1_161bp_without_motifs_123456_7.bed -fo MCF7DHS_Rep1_161bp_without_motifs_123456_7.fasta
mast -mt 0.0005 -hit_list -best ${dir}TGATAA_streme.txt MCF7DHS_Rep1_161bp_without_motifs_123456_7.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round8.txt 
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round8.txt 
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round8.bed #4383
intersectBed -v -a MCF7DHS_Rep1_161bp_without_motifs_123456_7.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round8.bed > MCF7DHS_Rep1_161bp_without_motifs_123456_78.bed
wc -l MCF7DHS_Rep1_161bp_without_motifs_123456_78.bed  #92388

MCF7_hg38_Encode_DHS_Rep2_161bp_peak.fasta

MEME (mast uses default p-value: 0.0001)

module load meme/5.4.1
module load R/4.1.2
module load bedtools
genome=/home/FCAM/ssun/Genome/hg38.fa
dir=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/Exhaustive_MEME_MAST_round1to8/individual_meme/

#round1
mast -hit_list -best ${dir}GATAmotif1_meme.txt MCF7_hg38_Encode_DHS_Rep2_161bp_peak.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round1.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round1.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round1.bed #5249
intersectBed -v -a MCF7_hg38_Encode_DHS_Rep2_161bp_peak.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round1.bed > MCF7DHS_Rep2_161bp_without_motifs_1.bed
wc -l MCF7DHS_Rep2_161bp_without_motifs_1.bed #167977

#round2
fastaFromBed -fi $genome -bed MCF7DHS_Rep2_161bp_without_motifs_1.bed -fo MCF7DHS_Rep2_161bp_without_motifs_1.fasta
mast -hit_list -best ${dir}GATAmotif2_meme.txt MCF7DHS_Rep2_161bp_without_motifs_1.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round2.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round2.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round2.bed #4249
intersectBed -v -a MCF7DHS_Rep2_161bp_without_motifs_1.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round2.bed > MCF7DHS_Rep2_161bp_without_motifs_12.bed
wc -l MCF7DHS_Rep2_161bp_without_motifs_12.bed #163728

#round3
fastaFromBed -fi $genome -bed MCF7DHS_Rep2_161bp_without_motifs_12.bed -fo MCF7DHS_Rep2_161bp_without_motifs_12.fasta
mast -hit_list -best ${dir}GATAmotif3_meme.txt MCF7DHS_Rep2_161bp_without_motifs_12.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round3.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round3.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round3.bed #1864
intersectBed -v -a MCF7DHS_Rep2_161bp_without_motifs_12.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round3.bed > MCF7DHS_Rep2_161bp_without_motifs_123.bed
wc -l MCF7DHS_Rep2_161bp_without_motifs_123.bed #161864

#round4
fastaFromBed -fi $genome -bed MCF7DHS_Rep2_161bp_without_motifs_123.bed -fo MCF7DHS_Rep2_161bp_without_motifs_123.fasta
mast -hit_list -best ${dir}GATAmotif4_meme.txt MCF7DHS_Rep2_161bp_without_motifs_123.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round4.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round4.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round4.bed #3075
intersectBed -v -a MCF7DHS_Rep2_161bp_without_motifs_123.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round4.bed > MCF7DHS_Rep2_161bp_without_motifs_1234.bed
wc -l MCF7DHS_Rep2_161bp_without_motifs_1234.bed #158789

#round5
fastaFromBed -fi $genome -bed MCF7DHS_Rep2_161bp_without_motifs_1234.bed -fo MCF7DHS_Rep2_161bp_without_motifs_1234.fasta
mast -hit_list -best ${dir}GATAmotif5_meme.txt MCF7DHS_Rep2_161bp_without_motifs_1234.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round5.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round5.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round5.bed #2510
intersectBed -v -a MCF7DHS_Rep2_161bp_without_motifs_1234.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round5.bed > MCF7DHS_Rep2_161bp_without_motifs_12345.bed
wc -l MCF7DHS_Rep2_161bp_without_motifs_12345.bed #156279


#round6
fastaFromBed -fi $genome -bed MCF7DHS_Rep2_161bp_without_motifs_12345.bed -fo MCF7DHS_Rep2_161bp_without_motifs_12345.fasta
mast -hit_list -best ${dir}GATAmotif6_meme.txt MCF7DHS_Rep2_161bp_without_motifs_12345.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round6.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round6.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round6.bed #2689
intersectBed -v -a MCF7DHS_Rep2_161bp_without_motifs_12345.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round6.bed > MCF7DHS_Rep2_161bp_without_motifs_123456.bed
wc -l MCF7DHS_Rep2_161bp_without_motifs_123456.bed #153590

STREME (mast uses p-value of 0.0005)

#round7
fastaFromBed -fi $genome -bed MCF7DHS_Rep2_161bp_without_motifs_123456.bed -fo MCF7DHS_Rep2_161bp_without_motifs_123456.fasta
mast -mt 0.0005 -hit_list -best ${dir}AGATAAM_streme.txt MCF7DHS_Rep2_161bp_without_motifs_123456.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round7.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round7.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round7.bed #8783
intersectBed -v -a MCF7DHS_Rep2_161bp_without_motifs_123456.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round7.bed > MCF7DHS_Rep2_161bp_without_motifs_123456_7.bed
wc -l MCF7DHS_Rep2_161bp_without_motifs_123456_7.bed #144807


#round8
fastaFromBed -fi $genome -bed MCF7DHS_Rep2_161bp_without_motifs_123456_7.bed -fo MCF7DHS_Rep2_161bp_without_motifs_123456_7.fasta
mast -mt 0.0005 -hit_list -best ${dir}TGATAA_streme.txt MCF7DHS_Rep2_161bp_without_motifs_123456_7.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round8.txt 
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round8.txt 
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round8.bed #4908
intersectBed -v -a MCF7DHS_Rep2_161bp_without_motifs_123456_7.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round8.bed > MCF7DHS_Rep2_161bp_without_motifs_123456_78.bed
wc -l MCF7DHS_Rep2_161bp_without_motifs_123456_78.bed  #139899

We have discovered 8 motifs using MEME/STREME software. Previously, we concatenated them into a single motif matrix database file (“combined_output_meme.txt”). Now, we aim to perform a coherence check by using MAST to scan this concatenated file against the remaining regions.

#rep1
mast -hit_list -best combined_output_meme.txt MCF7DHS_Rep1_161bp_without_motifs_123456_78.bed > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round9.txt 
#rep2
mast -hit_list -best combined_output_meme.txt MCF7DHS_Rep2_161bp_without_motifs_123456_78.bed > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round9.txt 

Mast the concatenated file against the remaining regions do not give overlapped regions.

2.4 Independent control – remove GATA3 peak regions from the DHS regions

To make an independent control, we can remove all overlapped GATA3 ChIP-seq peak regions from the DHS regions. Then we will MAST against the remaining DHS regions, and see what is the random odds we could get GATA3 bindin region.

module load bedtools
sizes=/home/FCAM/ssun/Genome/hg38.chrom.sizes
dir=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/peak_call/

slopBed -b 80 -i ${dir}GATA_ChIP_summits_final.bed -g $sizes  | sort -k1,1 -k2,2n > GATA_ChIP_summit_161window.bed

#remove overlapped GATA3 peak regions from DHS regions
intersectBed -v -a MCF7_hg38_Encode_DHS_Rep1_161bp_peak.bed -b GATA_ChIP_summit_161window.bed > MCF7_hg38_Encode_DHS_Rep1_161bp_noGATA_peak.bed
intersectBed -v -a MCF7_hg38_Encode_DHS_Rep2_161bp_peak.bed -b GATA_ChIP_summit_161window.bed > MCF7_hg38_Encode_DHS_Rep2_161bp_noGATA_peak.bed
wc -l MCF7_hg38_Encode_DHS_Rep1_161bp_peak.bed #126690
wc -l MCF7_hg38_Encode_DHS_Rep1_161bp_noGATA_peak.bed #84362
wc -l MCF7_hg38_Encode_DHS_Rep2_161bp_peak.bed #173226
wc -l MCF7_hg38_Encode_DHS_Rep2_161bp_noGATA_peak.bed #131896

convert to fasta.

module load bedtools
genome=/home/FCAM/ssun/Genome/hg38.fa

fastaFromBed -fi $genome -bed MCF7_hg38_Encode_DHS_Rep1_161bp_noGATA_peak.bed -fo MCF7_hg38_Encode_DHS_Rep1_161bp_noGATA_peak.fasta
fastaFromBed -fi $genome -bed MCF7_hg38_Encode_DHS_Rep2_161bp_noGATA_peak.bed -fo MCF7_hg38_Encode_DHS_Rep2_161bp_noGATA_peak.fasta
head -3 MCF7_hg38_Encode_DHS_Rep1_161bp_noGATA_peak.bed
head -6 MCF7_hg38_Encode_DHS_Rep1_161bp_noGATA_peak.fasta
head -3 MCF7_hg38_Encode_DHS_Rep2_161bp_noGATA_peak.bed
head -6 MCF7_hg38_Encode_DHS_Rep2_161bp_noGATA_peak.fasta
## chr1 267964  268125
## chr1 586055  586216
## chr1 629895  630056
## >chr1:267964-268125
## GGAGACTGATGTGGTTTCTCCTCAGTTTCTCTGTGCAGCACCAGGTGGCAGCAGAGGTCAGCAAGGCAAACCCGAGCCCGGGGATGCGGAGTGGGGGCAGCTACGTCCTCTCTTGAGCTACAGCAGATTCACTCTGTTCTGTTTCATTGTTGTTTAGTTTG
## >chr1:586055-586216
## TTTCCCACATTATTCAGCTTCTGAAAGGGTTGCTTGACCCACAGATGTGAAGCTGAGGCTGAAGGAGACTGATGTGGTTTCTCCTCAGTTTCTCTGTGCGGCACCAGGTGGCAGCAGAGGTCAGCAAGGCAAACCCGAGCCCGGGGATGCGGGGTGGGGGC
## >chr1:629895-630056
## TAACCAATACCACCAATCAATACTCATCATTAATAATCATAATGGCTATAGCAATAAAACTAGGAATAGCCCCCTTTCACTTCTGAGTCCCAGAGGTTACCCAAGGCACCCCTCTGACATCCGGCCTGCTCCTTCTCACATGACAAAAACTAGCCCCCATC
## chr1 10175   10336
## chr1 16155   16316
## chr1 267904  268065
## >chr1:10175-10336
## aacctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaacccctaaccctaaccctaaaccctaaaccctaaccctaaccctaaccctaaccctaaccccaaccccaaccccaaccccaaccccaaccccaaccctaacccctaa
## >chr1:16155-16316
## CAGATACTCCCTGCTTCCTCTCTAGCCCCCACCCTGCAGAGCTGGACCCCTGAGCTAGCCATGCTCTGACAGTCTCAGTTGCACACACGAGCCAGCAGAGGGGTTTTGTGCCACTTCTGGATGCTAGGGTTACACTGGGAGACACAGCAGTGAAGCTGAAA
## >chr1:267904-268065
## CCCACATTATACAGCTTCTGAAAGGGTTGCTTGACCCACAGATGTGAAGCTGAGGCTGAAGGAGACTGATGTGGTTTCTCCTCAGTTTCTCTGTGCAGCACCAGGTGGCAGCAGAGGTCAGCAAGGCAAACCCGAGCCCGGGGATGCGGAGTGGGGGCAGC

MCF7_hg38_Encode_DHS_Rep1_161bp_noGATA_peak.fasta

MEME (mast uses default p-value: 0.0001)

module load meme/5.4.1
module load R/4.1.2
module load bedtools
genome=/home/FCAM/ssun/Genome/hg38.fa
dir=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/Exhaustive_MEME_MAST_round1to8/individual_meme/

#round1
mast -hit_list -best ${dir}GATAmotif1_meme.txt MCF7_hg38_Encode_DHS_Rep1_161bp_noGATA_peak.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round1.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round1.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round1.bed #1739
intersectBed -v -a MCF7_hg38_Encode_DHS_Rep1_161bp_noGATA_peak.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round1.bed > MCF7DHS_Rep1_161bp_noGATA_without_motifs_1.bed
wc -l MCF7DHS_Rep1_161bp_noGATA_without_motifs_1.bed #82623

#round2
fastaFromBed -fi $genome -bed MCF7DHS_Rep1_161bp_noGATA_without_motifs_1.bed -fo MCF7DHS_Rep1_161bp_noGATA_without_motifs_1.fasta
mast -hit_list -best ${dir}GATAmotif2_meme.txt MCF7DHS_Rep1_161bp_noGATA_without_motifs_1.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round2.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round2.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round2.bed #1517
intersectBed -v -a MCF7DHS_Rep1_161bp_noGATA_without_motifs_1.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round2.bed > MCF7DHS_Rep1_161bp_noGATA_without_motifs_12.bed
wc -l MCF7DHS_Rep1_161bp_noGATA_without_motifs_12.bed #81106

#round3
fastaFromBed -fi $genome -bed MCF7DHS_Rep1_161bp_noGATA_without_motifs_12.bed -fo MCF7DHS_Rep1_161bp_noGATA_without_motifs_12.fasta
mast -hit_list -best ${dir}GATAmotif3_meme.txt MCF7DHS_Rep1_161bp_noGATA_without_motifs_12.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round3.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round3.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round3.bed #802
intersectBed -v -a MCF7DHS_Rep1_161bp_noGATA_without_motifs_12.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round3.bed > MCF7DHS_Rep1_161bp_noGATA_without_motifs_123.bed
wc -l MCF7DHS_Rep1_161bp_noGATA_without_motifs_123.bed #80304

#round4
fastaFromBed -fi $genome -bed MCF7DHS_Rep1_161bp_noGATA_without_motifs_123.bed -fo MCF7DHS_Rep1_161bp_noGATA_without_motifs_123.fasta
mast -hit_list -best ${dir}GATAmotif4_meme.txt MCF7DHS_Rep1_161bp_noGATA_without_motifs_123.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round4.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round4.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round4.bed #1479
intersectBed -v -a MCF7DHS_Rep1_161bp_noGATA_without_motifs_123.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round4.bed > MCF7DHS_Rep1_161bp_noGATA_without_motifs_1234.bed
wc -l MCF7DHS_Rep1_161bp_noGATA_without_motifs_1234.bed #78825

#round5
fastaFromBed -fi $genome -bed MCF7DHS_Rep1_161bp_noGATA_without_motifs_1234.bed -fo MCF7DHS_Rep1_161bp_noGATA_without_motifs_1234.fasta
mast -hit_list -best ${dir}GATAmotif5_meme.txt MCF7DHS_Rep1_161bp_noGATA_without_motifs_1234.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round5.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round5.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round5.bed #1219
intersectBed -v -a MCF7DHS_Rep1_161bp_noGATA_without_motifs_1234.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round5.bed > MCF7DHS_Rep1_161bp_noGATA_without_motifs_12345.bed
wc -l MCF7DHS_Rep1_161bp_noGATA_without_motifs_12345.bed #77606


#round6
fastaFromBed -fi $genome -bed MCF7DHS_Rep1_161bp_noGATA_without_motifs_12345.bed -fo MCF7DHS_Rep1_161bp_noGATA_without_motifs_12345.fasta
mast -hit_list -best ${dir}GATAmotif6_meme.txt MCF7DHS_Rep1_161bp_noGATA_without_motifs_12345.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round6.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round6.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round6.bed #1239
intersectBed -v -a MCF7DHS_Rep1_161bp_noGATA_without_motifs_12345.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round6.bed > MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456.bed
wc -l MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456.bed #76367

STREME (mast uses p-value of 0.0005)

#round7
fastaFromBed -fi $genome -bed MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456.bed -fo MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456.fasta
mast -mt 0.0005 -hit_list -best ${dir}AGATAAM_streme.txt MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round7.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round7.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round7.bed #4663
intersectBed -v -a MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round7.bed > MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456_7.bed
wc -l MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456_7.bed #71704


#round8
fastaFromBed -fi $genome -bed MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456_7.bed -fo MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456_7.fasta
mast -mt 0.0005 -hit_list -best ${dir}TGATAA_streme.txt MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456_7.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round8.txt 
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round8.txt 
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round8.bed #2585
intersectBed -v -a MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456_7.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round8.bed > MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456_78.bed
wc -l MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456_78.bed  #69119

MCF7_hg38_Encode_DHS_Rep2_161bp_noGATA_peak.fasta

MEME (mast uses default p-value: 0.0001)

module load meme/5.4.1
module load R/4.1.2
module load bedtools
genome=/home/FCAM/ssun/Genome/hg38.fa
dir=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/Exhaustive_MEME_MAST_round1to8/individual_meme/

#round1
mast -hit_list -best ${dir}GATAmotif1_meme.txt MCF7_hg38_Encode_DHS_Rep2_161bp_noGATA_peak.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round1.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round1.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round1.bed #1976
intersectBed -v -a MCF7_hg38_Encode_DHS_Rep2_161bp_noGATA_peak.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round1.bed > MCF7DHS_Rep2_161bp_noGATA_without_motifs_1.bed
wc -l MCF7DHS_Rep2_161bp_noGATA_without_motifs_1.bed #129920

#round2
fastaFromBed -fi $genome -bed MCF7DHS_Rep2_161bp_noGATA_without_motifs_1.bed -fo MCF7DHS_Rep2_161bp_noGATA_without_motifs_1.fasta
mast -hit_list -best ${dir}GATAmotif2_meme.txt MCF7DHS_Rep2_161bp_noGATA_without_motifs_1.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round2.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round2.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round2.bed #1556
intersectBed -v -a MCF7DHS_Rep2_161bp_noGATA_without_motifs_1.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round2.bed > MCF7DHS_Rep2_161bp_noGATA_without_motifs_12.bed
wc -l MCF7DHS_Rep2_161bp_noGATA_without_motifs_12.bed #128364

#round3
fastaFromBed -fi $genome -bed MCF7DHS_Rep2_161bp_noGATA_without_motifs_12.bed -fo MCF7DHS_Rep2_161bp_noGATA_without_motifs_12.fasta
mast -hit_list -best ${dir}GATAmotif3_meme.txt MCF7DHS_Rep2_161bp_noGATA_without_motifs_12.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round3.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round3.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round3.bed #818
intersectBed -v -a MCF7DHS_Rep2_161bp_noGATA_without_motifs_12.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round3.bed > MCF7DHS_Rep2_161bp_noGATA_without_motifs_123.bed
wc -l MCF7DHS_Rep2_161bp_noGATA_without_motifs_123.bed #127546

#round4
fastaFromBed -fi $genome -bed MCF7DHS_Rep2_161bp_noGATA_without_motifs_123.bed -fo MCF7DHS_Rep2_161bp_noGATA_without_motifs_123.fasta
mast -hit_list -best ${dir}GATAmotif4_meme.txt MCF7DHS_Rep2_161bp_noGATA_without_motifs_123.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round4.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round4.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round4.bed #1688
intersectBed -v -a MCF7DHS_Rep2_161bp_noGATA_without_motifs_123.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round4.bed > MCF7DHS_Rep2_161bp_noGATA_without_motifs_1234.bed
wc -l MCF7DHS_Rep2_161bp_noGATA_without_motifs_1234.bed #125858

#round5
fastaFromBed -fi $genome -bed MCF7DHS_Rep2_161bp_noGATA_without_motifs_1234.bed -fo MCF7DHS_Rep2_161bp_noGATA_without_motifs_1234.fasta
mast -hit_list -best ${dir}GATAmotif5_meme.txt MCF7DHS_Rep2_161bp_noGATA_without_motifs_1234.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round5.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round5.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round5.bed #1385
intersectBed -v -a MCF7DHS_Rep2_161bp_noGATA_without_motifs_1234.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round5.bed > MCF7DHS_Rep2_161bp_noGATA_without_motifs_12345.bed
wc -l MCF7DHS_Rep2_161bp_noGATA_without_motifs_12345.bed #124473


#round6
fastaFromBed -fi $genome -bed MCF7DHS_Rep2_161bp_noGATA_without_motifs_12345.bed -fo MCF7DHS_Rep2_161bp_noGATA_without_motifs_12345.fasta
mast -hit_list -best ${dir}GATAmotif6_meme.txt MCF7DHS_Rep2_161bp_noGATA_without_motifs_12345.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round6.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round6.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round6.bed #1443
intersectBed -v -a MCF7DHS_Rep2_161bp_noGATA_without_motifs_12345.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round6.bed > MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456.bed
wc -l MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456.bed #123030

STREME (mast uses p-value of 0.0005)

#round7
fastaFromBed -fi $genome -bed MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456.bed -fo MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456.fasta
mast -mt 0.0005 -hit_list -best ${dir}AGATAAM_streme.txt MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round7.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round7.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round7.bed #5859
intersectBed -v -a MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round7.bed > MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456_7.bed
wc -l MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456_7.bed #117171


#round8
fastaFromBed -fi $genome -bed MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456_7.bed -fo MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456_7.fasta
mast -mt 0.0005 -hit_list -best ${dir}TGATAA_streme.txt MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456_7.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round8.txt 
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round8.txt 
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round8.bed #3397
intersectBed -v -a MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456_7.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round8.bed > MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456_78.bed
wc -l MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456_78.bed  #113774

2.4.1 bar plot

bar plot (group by peak intensity)
The fraction of peaks with MEME-found-motif 123456, as well as STREME-found motif78, in each quantile:

cp /home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/Exhaustive_MEME_MAST_round1to8/bar_plot/quantile.240103.sum.161window.txt .

We will add the two DHS file.

#MCF7DHS_Rep1_161bp_without_motifs_123456_78.bed
#MCF7DHS_Rep2_161bp_without_motifs_123456_78.bed

for i in MCF7DHS*_161bp_without_motifs_123456_78.bed
do
 nm=$(echo $i | awk -F"_161bp_without_motifs_123456_78.bed" '{print $1}')
 rep=$(echo $i | awk -F"_161bp_without_motifs_123456_78.bed" '{print $1}' | awk -F"MCF7DHS_" '{print $2}')
 p=$(wc -l $i | awk '{OFS="\t";} {print $1}')
 totalp=$(wc -l MCF7_hg38_Encode_DHS_${rep}_161bp_peak.bed | awk '{OFS="\t";} {print $1}')
 withoutMotif=$(bc <<< "scale=4 ; $p / $totalp")
 withMotif=$(bc <<< "scale=4 ; ($totalp- $p) / $totalp")
 echo $nm "" $withoutMotif "" $withMotif>> quantile.240103.sum.161window.txt
done
#MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456_78.bed
#MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456_78.bed

for i in MCF7*_161bp_noGATA_without_motifs_123456_78.bed
do
 nm=$(echo $i | awk -F"_161bp_noGATA_without_motifs_123456_78.bed" '{print $1}')
 rep=$(echo $i | awk -F"_161bp_noGATA_without_motifs_123456_78.bed" '{print $1}' | awk -F"MCF7DHS_" '{print $2}')
 p=$(wc -l $i | awk '{OFS="\t";} {print $1}')
 totalp=$(wc -l MCF7_hg38_Encode_DHS_${rep}_161bp_noGATA_peak.bed | awk '{OFS="\t";} {print $1}')
 withoutMotif=$(bc <<< "scale=4 ; $p / $totalp")
 withMotif=$(bc <<< "scale=4 ; ($totalp- $p) / $totalp")
 echo "${nm}_noGATA" "" $withoutMotif "" $withMotif>> quantile.240103.sum.161window.txt
done
module load R/4.1.2 
R 

library("lattice") 
library("reshape2")


df_sum=read.table('quantile.240103.sum.161window.txt', sep = "", header=FALSE)
colnames(df_sum)=c("quantile","without_Motif","with_Motif")
df_sum_long <- df_sum

df_sum_long <- melt(df_sum_long, id = "quantile")
df_sum_long$variable<- factor(df_sum_long$variable, levels = c("with_Motif", "without_Motif"))

pdf('240105_percent_peaks_withorwithout_GATA3_motif_variant.pdf',width=8,height=5)
my.settings <- list(
  superpose.polygon=list(col=c("black", "grey"), border="transparent"),
  strip.background=list(col="grey80", cex = 0.6),
  strip.border=list(col="black")
)
print(barchart(value ~ quantile,         
         data = df_sum_long,
         groups = variable,
         stack = TRUE,
         auto.key=list(space="right"),
         scales = list(x = list(rot = 45)),
         ylab = "peak set within intensity quantile",
         xlab = "fraction of peaks with/without GATA3 motif variant",
         par.settings = my.settings)
)
dev.off()
percent peaks with or without GATA3 motif (variant 1,2,3,4,5,6,7,8):
percent peaks with or without GATA3 motif variants

Figure 3: percent peaks with or without GATA3 motif variants

For peaks in the top quantile intensity (top 5% highest intensity), we have ~16% peaks MEME/STREME are not able to find GATA3 binding elements.

We have added full DHS control and an independent DHS control.

Analysis steps:
1) Retrieved MCF7 DHS data (rep1 & rep2) from GSE29692 and used UCSC liftover to convert the data from hg19 to hg38.
2) Modified the DHS regions to 161bp windows.
3) Full DHS control: Applied the same MAST stringency to eliminate regions that overlapped with the 8 MEME/STREME de novo GATA3-like motifs previously identified in GATA3 ChIP peaks.
4) DHS regions removed GATA3 peak regions: use intersectBed to remove overlapped GATA3 ChIP-seq peak regions from the DHS regions, then applied the same MAST stringency to eliminate regions that overlapped with the 8 MEME/STREME de novo GATA3-like motifs previously identified in GATA3 ChIP peaks.
5) Coherence check: Employed MAST on the concatenated motif database against the regions with removed GATA3 motifs —no further regions were found.
6) Counted the regions with and without GATA3 motifs and added them to the bar plot.

Conclusion:
The left four bars are the MCF7 DHS controls. From left to right, they are full rep1 DHS regions, rep1 DHS regions that removed overlapped GATA3 ChIP peak regions, full rep2 DHS regions and rep2 DHS regions that removed overlapped GATA3 ChIPpeak regions.
Among the GATA3 peaks in the lowest intensity quantile (quantile 0.05), over 45% of the peaks contain GATA3 binding elements as identified by MEME/STREME.
The full DHS regions show a 27% overlap with GATA3 motifs in the 1st replicate and a 19% overlap in the 2nd replicate.
The DHS regions removed overlapped GATA3 peak regions has 18% (rep1) and 13.7% (rep2) peaks with GATA3 motifs.

This indicates that randomly we would expect a 14~18% regions has GATA3 motifs found. And GATA3 binding is more specific to regions within the genome (represented by GATA3 ChIP-seq peaks) rather than being uniformly distributed across all accessible DNA regions (represented by the DHS regions).
Consequently, we can infer that even within the GATA3 peaks falling within the lowest quantile, there are significant and meaningful binding sites for GATA3.

2.5 GATA3 ChIP peak without MEME/STREME found GATA3 motif

Background: we have GATA3 peak that have exhaustively searched for GATA3-like motif with MEME/STREME software. However, there are ~38.5% peak failed in finding a GATA3-like motif. Even for peak subset that have relatively high peak intensity (top5%), we still have ~16% peaks that MEME/STREME cannot find a GATA3-like motif.

Generally, we would expect to find binding sequences within the peak region, so that the ChIPed transcription factor (in our case, GATA3) can bind to. This is even more true for peaks with relatively high intensity.

Our question now became, is MEME/STREME limited in finding binding sequences patterned like GATA3 (that has fixed spacings between two 3mer)? Can we find the 3mer-3mer sequences in GATA3 peaks that do not contain the MEME/STREME found GATA3-like motifs?

2.6 Process the negative control

2.6.1 MCF7 DHS

There are four DHS negative controls: two reps of full MCF7 DHS regions (refer as “full DHS control”), two reps of MCF7 DHS regions without GATA3 ChIP peak regions (From here, I will refer this sets of control as “independent DHS control”).
All DHS negative controls are removed of the 8 GATA3-like motifs, because we want to compare them to the GATA3 peaks that MEME/STREME failed to find GATA3-like de novo motifs.

  • “full DHS control”
MCF7DHS_Rep1_161bp_without_motifs_123456_78.bed  #92388
MCF7DHS_Rep2_161bp_without_motifs_123456_78.bed  #139899
  • “independent DHS control”
MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456_78.bed  #69119
MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456_78.bed  #113774

2.7 Process the positive control

There are 8 GATA3-like motifs found by MEME/STREME. Among them, 5 of them have GATA3 binding motifs that are formed by two GAT (or ATC) 3mer and various spacings.

I will utilize MAST against the 161bp window peaks for all 8 motifs (the concatenated database file) to identify motif coordinates corresponding to each motif. Subsequently, I’ll employ intersectBed to isolate the peak sets containing the best motif within the 161bp window. The peak sets characterized by the presence of the specific motif structures that we are insterested in will serve as positive controls.

#GATA_ChIP_summit_161window.bed #the 161bp window around peak summit from the full GATA3 peak
module load meme/5.4.1
module load R/4.1.2
module load bedtools
genome=/home/FCAM/ssun/Genome/hg38.fa
dir=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/peak_call/
dir2=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/Exhaustive_MEME_MAST_round1to8/

fastaFromBed -fi $genome -bed ${dir}GATA_ChIP_summit_161window.bed -fo GATA_ChIP_summit_161window.fasta


mast -hit_list -best ${dir2}combined_output_meme.txt GATA_ChIP_summit_161window.fasta > mast_GATA3_PSWM_in_peaks_161win.txt 

wget https://raw.githubusercontent.com/sysunn/siyu_daily_update/main/December_2023/parse_multi_mast_to_coordinates.R
Rscript parse_multi_mast_to_coordinates.R mast_GATA3_PSWM_in_peaks_161win.txt  8
wc -l mast_GATA3_PSWM_in_peaks_161win.bed
head -3 mast_GATA3_PSWM_in_peaks_161win.bed
tail -3 mast_GATA3_PSWM_in_peaks_161win.bed
awk '{print > ("mast_GATA3_PSWM_in_peak_161win_motif_" $7 ".bed")}' mast_GATA3_PSWM_in_peaks_161win.bed
ls mast_GATA3_PSWM_in_peak_161win_motif*.bed
##    63124 mast_GATA3_PSWM_in_peaks_161win.bed
## chr1 1505049 1505056 1230.47 6.56e-05    +   1
## chr1 1883638 1883645 1230.47 6.56e-05    +   1
## chr1 5603508 5603515 1312.58 2.16e-05    -   1
## chrX 155296326   155296337   1713.86 1.94e-06    -   8
## chrY 786683  786694  1635.44 4.15e-06    -   8
## chrY 4437398 4437409 1439.4  1.66e-05    -   8
## mast_GATA3_PSWM_in_peak_161win_motif_1.bed
## mast_GATA3_PSWM_in_peak_161win_motif_2.bed
## mast_GATA3_PSWM_in_peak_161win_motif_3.bed
## mast_GATA3_PSWM_in_peak_161win_motif_4.bed
## mast_GATA3_PSWM_in_peak_161win_motif_5.bed
## mast_GATA3_PSWM_in_peak_161win_motif_7.bed
## mast_GATA3_PSWM_in_peak_161win_motif_8.bed
  • round1 motif: GAT—ATC
intersectBed -wa -a ${dir}GATA_ChIP_summit_161window.bed -b mast_GATA3_PSWM_in_peak_161win_motif_1.bed | sort -k1,1 -k2,2n | uniq > GATA3_peak_161win_with_motif_1.bed
wc -l GATA3_peak_161win_with_motif_1.bed # 7431 
head -5 GATA3_peak_161win_with_motif_1.bed
#chr1   1504941 1505102 GATA_ChIP_peak_56   16.8938
#chr1   1883513 1883674 GATA_ChIP_peak_76   36.5894
#chr1   5603376 5603537 GATA_ChIP_peak_106  2.64066
#chr1   6795832 6795993 GATA_ChIP_peak_132  646.195
#chr1   7549498 7549659 GATA_ChIP_peak_153  33.955
  • round2 motif: GAT—-ATC
intersectBed -wa -a ${dir}GATA_ChIP_summit_161window.bed -b mast_GATA3_PSWM_in_peak_161win_motif_2.bed | sort -k1,1 -k2,2n | uniq > GATA3_peak_161win_with_motif_2.bed
wc -l GATA3_peak_161win_with_motif_2.bed #9352 
head -5 GATA3_peak_161win_with_motif_2.bed
#chr1   917454  917615  GATA_ChIP_peak_32   75.0238
#chr1   6719989 6720150 GATA_ChIP_peak_129  10.6294
#chr1   6800848 6801009 GATA_ChIP_peak_133  261.961
#chr1   6959568 6959729 GATA_ChIP_peak_137  253.515
#chr1   7266109 7266270 GATA_ChIP_peak_145  33.955
  • round4 motif: GAT—–GAT
intersectBed -wa -a ${dir}GATA_ChIP_summit_161window.bed -b mast_GATA3_PSWM_in_peak_161win_motif_4.bed | sort -k1,1 -k2,2n | uniq > GATA3_peak_161win_with_motif_4.bed
wc -l GATA3_peak_161win_with_motif_4.bed # 4815
head -5 GATA3_peak_161win_with_motif_4.bed
#chr1   869417  869578  GATA_ChIP_peak_30   3.78065
#chr1   3186480 3186641 GATA_ChIP_peak_86   15.1046
#chr1   3717401 3717562 GATA_ChIP_peak_94   21.5896
#chr1   8626522 8626683 GATA_ChIP_peak_206  126.582
#chr1   8897599 8897760 GATA_ChIP_peak_229  33.955
  • round5 motif: ATC-GAT
intersectBed -wa -a ${dir}GATA_ChIP_summit_161window.bed -b mast_GATA3_PSWM_in_peak_161win_motif_5.bed | sort -k1,1 -k2,2n | uniq > GATA3_peak_161win_with_motif_5.bed
wc -l GATA3_peak_161win_with_motif_5.bed # 9180
head -5 GATA3_peak_161win_with_motif_5.bed
#chr1   869417  869578  GATA_ChIP_peak_30   3.78065
#chr1   917454  917615  GATA_ChIP_peak_32   75.0238
#chr1   1859356 1859517 GATA_ChIP_peak_75b  62.1909
#chr1   3926997 3927158 GATA_ChIP_peak_98   14.2308
#chr1   4659185 4659346 GATA_ChIP_peak_102  54.7665

GATA3_peak_161win_with_motif_1.bed
GATA3_peak_161win_with_motif_2.bed
GATA3_peak_161win_with_motif_4.bed
GATA3_peak_161win_with_motif_5.bed

2.7.1 barchart

When using mast to assess all 8 motifs concurrently, the -best option selectively assigns each peak region with only the most significant motif hit from the list of motif sites.

Previously, MEME identified a set of peaks (n=4510) containing motif6, which are now being associated with motifs other than motif6. Illustrating the fraction of peaks assigned with alternative motifs in this peak set via a bar chart would offer a visual representation of this transition.

module load meme/5.4.1
module load R/4.1.2
module load bedtools
dir=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/Exhaustive_MEME_MAST_round1to8/

intersectBed -wa -a ${dir}without_motifs_12345.bed -b ${dir}mast_GATA3_PSWM_in_peaks_round6.bed > peak_101win_with_motif6.bed #4510
head peak
#chr1   6800878 6800979 GATA_ChIP_peak_133  261.961
#chr1   6959598 6959699 GATA_ChIP_peak_137  253.515
#chr1   8015116 8015217 GATA_ChIP_peak_161  378.26
#chr1   8312136 8312237 GATA_ChIP_peak_191  21.5896
#chr1   8421170 8421271 GATA_ChIP_peak_197  10.8826

sizes=/home/FCAM/ssun/Genome/hg38.chrom.sizes
slopBed -b 30 -i peak_101win_with_motif6.bed -g $sizes  | sort -k1,1 -k2,2n > peak_161win_with_motif6.bed #4510

The fourth column in the BED file represents peak indexes, with a total of 4510 peaks that were previously identified to contain motif6. Our current objective is to examine the specific weight matrices assigned to these peaks when subjected to a mast analysis involving 8 different motifs.

dir=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/peak_call/
intersectBed -wa -wb -a ${dir}GATA_ChIP_summit_161window.bed -b mast_GATA3_PSWM_in_peaks_161win.bed | sort -k1,1 -k2,2n | uniq | awk '{OFS="\t"} {print $1, $2, $3, $4, $12}' | uniq  > GATA3_peak_161win_with_all_mast_motif.bed #63127

# 63127 is larger than peak sets with motif 12345678 (59560), there might be peaks repeated because it has been assigned with more than 1 motif. 
# coherence check:
head -5 GATA3_peak_161win_with_all_mast_motif.bed
#chr1   869417  869578  GATA_ChIP_peak_30   4
#chr1   869417  869578  GATA_ChIP_peak_30   5
#chr1   917454  917615  GATA_ChIP_peak_32   5
#chr1   917454  917615  GATA_ChIP_peak_32   2
#chr1   996069  996230  GATA_ChIP_peak_35   8

awk '{OFS="\t"} {print $4}' GATA3_peak_161win_with_all_mast_motif.bed | uniq | wc -l #51122
# 51122 is less than 59560, which is expected, because we have used different mast stringency 

Two way to extract peak and motif information: one is use intersectBed to get overlapped peaks between peak sets previously found to contain motif 6 and peak sets that we mast against use the concatanated motif database (that we no longer see peak with motif6). Notice that if we use this method, we need to make sure that we use the 161win peak region (peak_161win_with_motif6.bed).

The other way is to use the peak index and awk to filter the rows in GATA3_peak_161win_with_all_mast_motif.bed based on whether their peak index (in the fourth column) is present in peak_101win_with_motif6.bed (peaks that previously found to contain motif6).

#method1
#intersectBed -wa -wb -a peak_101win_with_motif6.bed -b mast_GATA3_PSWM_in_peaks_161win.bed | sort -k1,1 -k2,2n | uniq | wc -l #4871
intersectBed -wa -wb -a peak_161win_with_motif6.bed -b mast_GATA3_PSWM_in_peaks_161win.bed | sort -k1,1 -k2,2n | uniq | wc -l #5323

#method2
awk 'FNR==NR{peaks[$4]; next} $4 in peaks' peak_101win_with_motif6.bed GATA3_peak_161win_with_all_mast_motif.bed > filtered_GATA3_peak_161win_with_all_mast_motif.bed #5323
head filtered_GATA3_peak_161win_with_all_mast_motif.bed
#chr1   6800848 6801009 GATA_ChIP_peak_133  2
#chr1   6959568 6959729 GATA_ChIP_peak_137  2
#chr1   8015086 8015247 GATA_ChIP_peak_161  2
#chr1   8312106 8312267 GATA_ChIP_peak_191  2
#chr1   8421140 8421301 GATA_ChIP_peak_197  2

#coherence check
#there are peaks assigned with more than 1 motif
awk '{OFS="\t"} {print $4}' filtered_GATA3_peak_161win_with_all_mast_motif.bed | uniq | wc -l #4510

Count numbers of peaks with different motifs and append to a txt summary file for plotting.
We will isolate peaks to different files based on whether they have the same motif (index, 5th column) assigned by mast.

awk '{print > ("filtered_GATA3_peak_161win_with_all_mast_motif_" $5 ".bed")}' filtered_GATA3_peak_161win_with_all_mast_motif.bed
for i in filtered_GATA3_peak_161win_with_all_mast_motif_*.bed
do
 wc -l $i
done

#197 filtered_GATA3_peak_161win_with_all_mast_motif_1.bed
#4420 filtered_GATA3_peak_161win_with_all_mast_motif_2.bed
#85 filtered_GATA3_peak_161win_with_all_mast_motif_3.bed
#220 filtered_GATA3_peak_161win_with_all_mast_motif_4.bed
#133 filtered_GATA3_peak_161win_with_all_mast_motif_5.bed
#121 filtered_GATA3_peak_161win_with_all_mast_motif_7.bed
#147 filtered_GATA3_peak_161win_with_all_mast_motif_8.bed
# Read the BED file
bed_data <- read.table("filtered_GATA3_peak_161win_with_all_mast_motif.bed", sep="\t", header=FALSE, stringsAsFactors=FALSE)

# Aggregate motif indices for each peak
aggregated_motifs <- aggregate(bed_data$V5, by = list(bed_data$V4), FUN = function(x) paste(unique(x), collapse = ","))

aggregated_motifs$nmotif <- sapply(aggregated_motifs$x, function(x) length(unlist(strsplit(x, ","))))
colnames(aggregated_motifs)=c("peak", "motif_indices", "nmotif")

head(aggregated_motifs)
#                   peak motif_indices nmotif
#1  GATA_ChIP_peak_10022             2      1
#2 GATA_ChIP_peak_10030b             2      1
#3  GATA_ChIP_peak_10031             2      1
#4   GATA_ChIP_peak_1004             2      1
#5 GATA_ChIP_peak_10041a           2,7      2
#6  GATA_ChIP_peak_10062             2      1
nrow(aggregated_motifs)
#[1] 4510
length(unique(aggregated_motifs$peak))
#[1] 4510
str(aggregated_motifs)
#'data.frame':  4510 obs. of  3 variables:
# $ peak         : chr  "GATA_ChIP_peak_10022" "GATA_ChIP_peak_10030b" "GATA_ChIP_peak_10031" "GATA_ChIP_peak_1004" ...
# $ motif_indices: chr  "2" "2" "2" "2" ...
# $ nmotif       : int  1 1 1 1 2 1 1 1 1 1 ...
library(lattice)


group_counts <- as.data.frame(table(aggregated_motifs$motif_indices))
names(group_counts) <- c('Group', 'Count')

summary(group_counts$Count)
#Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   1.00    1.00    1.00   67.31    7.50 3680.00 

group_counts$Group=factor(group_counts$Group, levels=c( "1", "2", "3", "4", "5",  "7", "8", "1,2", "1,4", "1,7", "1,8",  "2,1", "2,3",    
"2,4", "2,5", "2,7", "2,8", "3,1", "3,2",  "4,1", "4,5", "4,2", "5,2", "5,4", "5,7", "5,8", "7,1", "7,2", "8,2",  "8,5",  "8,7", "1,2,3",   "1,2,4",  " 1,2,5",  "1,2,7", "1,4,2", "1,4,3", "2,1,3", "2,1,4", "2,1,5", "2,1,8", "2,4,3", "2,4,5", "2,4,8", "2,7,4", "2,8,1",  "2,8,3", "3,2,5",   "3,2,8",   "3,4,2", "4,2,1",   "4,2,5",   "4,2,7",   "4,2,8", "4,8,2",  "5,2,1",   "5,2,3",   "5,2,7",  "7,2,3", "7,2,5", "7,2,8",   "7,4,1" , "8,1,2", "8,2,1", "8,2,5",   "8,4,2",  "4,1,2,7"))

# Create the bar plot
pdf('240110_peaks_with_GATA3_motif_variant.pdf',width=15,height=10)
#custom_labels <- seq(1, 4000, by = 20)
print(barchart(Count ~ Group,
         data= group_counts, 
         col = "skyblue", 
         ylim=c(0, 4000),
         xlab = "motif_indices", 
         ylab = "Number of Peaks",
         main = "Number of Peaks contain the motif",
         scales = list(x = list(rot = 45)), #y = list(at = custom_labels)
         horizontal = FALSE)
      )
dev.off()
peak_with_motif6 assigned with new motifs when mast with database

Figure 4: peak_with_motif6 assigned with new motifs when mast with database

The peak sets that used to defined as containing motif6 when mast single motif (motif6) against peaks without motif12345, are now assigned with other motifs when mast against entire GATA3 peak sets with all 8 motifs.

This suggest that GATA3 can bind to its binding element of different/various compositions.

3 GAT-GAT/ATC 3mer analaysis

  1. get all peaks (161bp window) without the 8 MEME/STREME motifs. Parse the peaks to 4 quantile based on peak intensity (use deseq2).

  2. for each quantile, as well as the two negative control (MCF7 DHS rep1, rep2), get the closest GAT to peak summit.

  3. find the second closest GAT.
    Make sure that the we want to use the relative distance between the two zinc fingers.
    and orientation.

3.1 Parse peak based on Intensity

peaks (161bp window) without 8 GATA3-motif (that found by MEME/STREME)

#cd /home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/Exhaustive_MEME_MAST_round1to8
wc -l without_motifs_123456_78_161bp_mast.bed #37308 
head -5 without_motifs_123456_78_161bp_mast.bed
#chr1   827300  827461  GATA_ChIP_peak_28   9.30478
#chr1   916689  916850  GATA_ChIP_peak_31   7.79887
#chr1   924773  924934  GATA_ChIP_peak_33   3.78065
#chr1   966573  966734  GATA_ChIP_peak_34   3.78065
#chr1   999428  999589  GATA_ChIP_peak_36   2.11515

3.1.1 Peak Intensity - bigWig and DEseq2

In the above file, the last column is the peak intensity reported by MACS3.
Notice that a peak with high MACS3-intensity is not necessarily a intense peak. There are two things we need to consider regarding peak intensities. The first is peak region; the second is dynamic range. Imaging a peak that is very intense but within a narrow range, the other peak is not so intense but can span its signals across a very long distance. We need to consider this region differences while calling any peak to be intense or not.

First use Sam Flag 0x3(reads paired and mapped in proper pair) to calculate the read depth for each GATA library.

# calculate the size factors 
module load samtools/1.12

dir=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/sorted.bam_final/
for i in GATA
do
  echo $i
  > ${i}_header.txt
  > ${i}_reads.txt
  for j in ${dir}MCF7_dTAGGATA522*_${i}_*.sorted.bam
  do
    echo $j
    name=$(echo $j | awk -F $dir '{print $2}' | awk -F".sorted.bam" '{print $1}')
    echo $name | paste ${i}_header.txt - > ${i}_tmp.txt 
    mv ${i}_tmp.txt ${i}_header.txt
    reads=`samtools view -c -f 0x3 $j` #count the reads paired and mapped in proper pair
    echo $reads | paste ${i}_reads.txt - > ${i}_tmp.txt 
    mv ${i}_tmp.txt ${i}_reads.txt
  done  
  cat ${i}_header.txt ${i}_reads.txt > ${i}_tmp.txt
  mv ${i}_tmp.txt ${i}_reads.txt
  rm ${i}_header.txt
done 
cat GATA_reads.txt
##  MCF7_dTAGGATA522_GATA_CC_rep1   MCF7_dTAGGATA522_GATA_CC_rep2   MCF7_dTAGGATA522_GATA_CC_rep3   MCF7_dTAGGATA522_GATA_CE_rep1   MCF7_dTAGGATA522_GATA_CE_rep2   MCF7_dTAGGATA522_GATA_CE_rep3   MCF7_dTAGGATA522_GATA_dE_rep1   MCF7_dTAGGATA522_GATA_dE_rep2   MCF7_dTAGGATA522_GATA_dE_rep3
##  33948616    32585396    34475586    51112588    147834968   136838760   34136142    136271358   85665512

bargraph

library(lattice)
df=as.data.frame(t(read.table("GATA_reads.txt", header=F)))
colnames(df)=c("library", "aligned_reads")
df$libraey=as.factor(df$library)
df$aligned_reads=as.numeric((df$aligned_reads))

barchart(aligned_reads ~ library, 
         data = df,
         ylim=c(0, max(df$aligned_reads)*1.04),
         col = "skyblue",
         scales = list(x = list(rot = 45)),
         xlab = "GATA3 ChIP library", 
         ylab = "concordantly aligned reads"
         )

This bar graph represents the read depth in each GATA3 ChIP-seq library. In the downstream analysis, we will need to use DESeq2 to normalize the counts in each library with size factor to account for this read depth difference.  

Get peak intensity within 400bp window withDeseq2 for peaks without MEME/STREME found motifs 12345678.

load peaks without motif 12345678.

#module load R/4.1.2
#R
a =  read.table('without_motifs_123456_78_161bp_mast.bed', sep = "\t", header=FALSE) 
nrow(a) #37308
## [1] 37308
head(a)
##     V1      V2      V3                V4      V5
## 1 chr1  827300  827461 GATA_ChIP_peak_28 9.30478
## 2 chr1  916689  916850 GATA_ChIP_peak_31 7.79887
## 3 chr1  924773  924934 GATA_ChIP_peak_33 3.78065
## 4 chr1  966573  966734 GATA_ChIP_peak_34 3.78065
## 5 chr1  999428  999589 GATA_ChIP_peak_36 2.11515
## 6 chr1 1000456 1000617 GATA_ChIP_peak_37 5.52004

Increase the width to 400bp window.

library(bigWig)
peak.region.400win=center.bed(a, upstreamWindow = 200, downstreamWindow = 200)
nrow(peak.region.400win)
## [1] 37308
head(peak.region.400win)
##     V1      V2      V3                V4      V5
## 1 chr1  827180  827581 GATA_ChIP_peak_28 9.30478
## 2 chr1  916569  916970 GATA_ChIP_peak_31 7.79887
## 3 chr1  924653  925054 GATA_ChIP_peak_33 3.78065
## 4 chr1  966453  966854 GATA_ChIP_peak_34 3.78065
## 5 chr1  999308  999709 GATA_ChIP_peak_36 2.11515
## 6 chr1 1000336 1000737 GATA_ChIP_peak_37 5.52004

In the below chunk, we define a function to get raw counts from each sample/reps. This function uses bed.region.bpQuery.bigWig from bigWig package. It requires a bed region file that has the peak region that we want to analysis (right now we want to analyse peak without motifs 123456789); it also requires non-normalized, raw bigWig files (generated directly from seqOutbias) to allow bed.region.bpQuery.bigWig to query the counts information from. The output from this function will be: rows will be the same as bed file, columns will be each bigWig library, entries will be the raw counts.

#functions on github
source('https://raw.githubusercontent.com/mjg54/znf143_pro_seq_analysis/master/docs/ZNF143_functions.R')

#function
get.counts.interval <- function(df, path.to.bigWig, file.prefix = 'H') {
    vec.names = c()
    inten.df=data.frame(matrix(ncol = 0, nrow = nrow(df)))
    
    for (mod.bigWig in Sys.glob(file.path(path.to.bigWig, paste(file.prefix, "*.bigWig", sep ='')))) {
        factor.name = strsplit(strsplit(mod.bigWig, "/")[[1]][length(strsplit(mod.bigWig, "/")[[1]])], '.bigWig')[[1]][1]
        print(factor.name)
        vec.names = c(vec.names, factor.name)
        loaded.bw = load.bigWig(mod.bigWig)
        print(mod.bigWig)
        mod.inten = bed.region.bpQuery.bigWig(loaded.bw, df, abs.value = TRUE)
        inten.df = cbind(inten.df, mod.inten)
    }
    colnames(inten.df) = vec.names
    r.names = paste(df[,1], ':', df[,2], '-', df[,3], sep='')
    row.names(inten.df) = r.names
    return(inten.df)
}

We will first use the defined function to query raw counts from each non-normalized bigWig files of GATA ChIP-seq use peak region info loaded before (data frame “peak.region.400win”).

library(bigWig)
#non-normalized counts
GATA.counts.df= get.counts.interval(peak.region.400win, "/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/bigWigs/Seqoutbias_bw","MCF") #25 libraries
#nrow(peak.region.400win)
#[1] 37308
#nrow(GATA.counts.df)
#[1] 37308
#head(GATA.counts.df)
#colnames(GATA.counts.df)


GATA.analysis.regions=GATA.counts.df[,grepl("_GATA_",colnames(GATA.counts.df))] # get non-normalized counts from "GATA" libraries
#colnames(GATA.analysis.regions)
#[1] "MCF7_dTAGGATA522_GATA_CC_rep1" "MCF7_dTAGGATA522_GATA_CC_rep2"
#[3] "MCF7_dTAGGATA522_GATA_CC_rep3" "MCF7_dTAGGATA522_GATA_CE_rep1"
#[5] "MCF7_dTAGGATA522_GATA_CE_rep2" "MCF7_dTAGGATA522_GATA_CE_rep3"
#[7] "MCF7_dTAGGATA522_GATA_dE_rep1" "MCF7_dTAGGATA522_GATA_dE_rep2"
#[9] "MCF7_dTAGGATA522_GATA_dE_rep3"
identical(rownames(GATA.analysis.regions),rownames(GATA.counts.df)) # [1] TRUE

Then we use the DESeq2 package to make a counts matrix (DESeqDataSetFromMatrix) and calculate size factors for each library (estimateSizeFactorsForMatrix) use the previously calculated read depth for each library (“GATA_reads.txt”); We use this size factor to normalize the counts (sizeFactors).
Then we use rowMeans to average the normalized counts for the three GATA_CC reps, and save it as “peak.intensities”; this normalized counts now can be use to determine if a peak is intense or not.

library(DESeq2)
sample.conditions = factor(sapply(strsplit(colnames(GATA.analysis.regions), '_rep'), '[', 1))
#[1] MCF7_dTAGGATA522_GATA_CC MCF7_dTAGGATA522_GATA_CC MCF7_dTAGGATA522_GATA_CC
#[4] MCF7_dTAGGATA522_GATA_CE MCF7_dTAGGATA522_GATA_CE MCF7_dTAGGATA522_GATA_CE
#[7] MCF7_dTAGGATA522_GATA_dE MCF7_dTAGGATA522_GATA_dE MCF7_dTAGGATA522_GATA_dE
#3 Levels: MCF7_dTAGGATA522_GATA_CC ... MCF7_dTAGGATA522_GATA_dE
deseq.counts.table = DESeqDataSetFromMatrix(countData = GATA.analysis.regions, # DESeqDataSet needs countData to be non-negative integers; non-normalized counts are integer, normalized signals has decimals.
                colData = as.data.frame(sample.conditions),
                design = ~ sample.conditions)


GATA.SF <- read.table("GATA_reads.txt", sep = '\t', header = TRUE)[,-1] # GATA size factors from read depth
#MCF7_dTAGGATA522_GATA_CC_rep1 MCF7_dTAGGATA522_GATA_CC_rep2
#                      33948616                      32585396
#  MCF7_dTAGGATA522_GATA_CC_rep3 MCF7_dTAGGATA522_GATA_CE_rep1
#                      34475586                      51112588
#  MCF7_dTAGGATA522_GATA_CE_rep2 MCF7_dTAGGATA522_GATA_CE_rep3
#                     147834968                     136838760
#  MCF7_dTAGGATA522_GATA_dE_rep1 MCF7_dTAGGATA522_GATA_dE_rep2
#                      34136142                     136271358
#  MCF7_dTAGGATA522_GATA_dE_rep3
#                      85665512
GATA.size.factors = estimateSizeFactorsForMatrix(GATA.SF) # read depth transformed to size factor 
#MCF7_dTAGGATA522_GATA_CC_rep1 MCF7_dTAGGATA522_GATA_CC_rep2 
#                    0.5385594                     0.5169334 
#MCF7_dTAGGATA522_GATA_CC_rep3 MCF7_dTAGGATA522_GATA_CE_rep1 
#                    0.5469193                     0.8108480 
#MCF7_dTAGGATA522_GATA_CE_rep2 MCF7_dTAGGATA522_GATA_CE_rep3 
#                    2.3452478                     2.1708044 
#MCF7_dTAGGATA522_GATA_dE_rep1 MCF7_dTAGGATA522_GATA_dE_rep2 
#                    0.5415343                     2.1618032 
#MCF7_dTAGGATA522_GATA_dE_rep3 
#                    1.3589941

                                       
sizeFactors(deseq.counts.table) <- GATA.size.factors # assign to each column of the count matrix (deseq.counts.table) the size factor to bring each column to a common scale
dds <- DESeq(deseq.counts.table)
normalized.counts.GATA3 = counts(dds, normalized=TRUE)
head(normalized.counts.GATA3)
peak.intensities = rowMeans(normalized.counts.GATA3[,1:3]) # we want to get the average read counts for CC groups
names(peak.intensities) = rownames(normalized.counts.GATA3)
save.image('240108_GATA3_ChIP_deseq.Rdata') 

Subset peaks without the 8 GATA3-motifs, each with 20% intensity quantile

We can load the saved Rdata and look at each dataframe.

module load R/4.1.2 
R 
load('240108_GATA3_ChIP_deseq.Rdata')

Read this .bed file into R, and use DeSeq2 to count read size and parse into different quantile

head(peak.intensities)
#chr1:827180-827581   chr1:916569-916970   chr1:924653-925054 
#            45.32469             36.69993             30.44684 
#  chr1:966453-966854   chr1:999308-999709 chr1:1000336-1000737 
#            26.87016             21.22455             24.34263 
length(peak.intensities)
#[1] 37308
quantile(peak.intensities, probs = seq(.20, 1.00, by = .20))
# 20%        40%        60%        80%       100% 
#  31.78709   40.05304   51.95512   80.03307 6270.21535 

violin plot

This violin plot is showing the distribution of peak intensity (log10).

library(lattice)
log10quantiles <- quantile(log(abs(peak.intensities), base = 10), probs = seq(.20, 1.00, by = .20))

png('violinplot_GATA3_ChIP_peak_normalized_intensity.png')
print(bwplot(log(abs(peak.intensities), base = 10) ~ factor("1"), 
       main = "Violin-Like Plot",
       panel = function(x, ...) {
         panel.violin(x, ...)
         panel.abline(h = log10quantiles, col = "red", lty = 2)
       },
       xlab = "", ylab = "log10 normalized Intensity")
)
dev.off()
normalized GATA3 ChIP peak intensity

Figure 5: normalized GATA3 ChIP peak intensity

This violin plot represents the distribution and probability density of (log10) normalized peak intensity (for peak without motif 1~8). Each red dotted line indicates the quantile cutoff of 20%, which can help us to visualize the peak intensity distribution across different quantile.  

Parse peaks into 5 intensity quantile by 20%.

chr = sapply(strsplit(names(peak.intensities), ":"), "[", 1)
rnge = sapply(strsplit(names(peak.intensities), ":"), "[", 2)
start = as.numeric(sapply(strsplit(rnge, "-"), "[", 1)) + 200
end = as.numeric(sapply(strsplit(rnge, "-"), "[", 2)) - 200

quantile(peak.intensities, probs = seq(.20, 1.00, by = .20))



# 1bp summit quantile file
j =0 
q=seq(.20, 1.00, by = .20)
count=0
for (i in quantile(peak.intensities, probs = seq(.20, 1.00, by = .20))){
count = count +1

write.table(file = paste0('quantile', as.character(q[count]), '_summits.bed'), data.frame(chr[peak.intensities > j & peak.intensities <= i], start[peak.intensities > j & peak.intensities <= i], end[peak.intensities > j & peak.intensities <= i], peak.intensities[peak.intensities > j & peak.intensities <= i]), sep = '\t', quote=FALSE, col.names=FALSE, row.names=FALSE )
j = i
}
for i in quantile*.bed
do
wc -l $i
done
##     7269 quantile0.2_summits.bed
##     7461 quantile0.4_summits.bed
##     7462 quantile0.6_summits.bed
##     7461 quantile0.8_summits.bed
##     7462 quantile1_summits.bed

3.1.2 Coherence check

See if the number of input samples will change the normalized reads.
I am loading fewer samples (only two condition, GATA_CC and GATA_dE)

GATA.analysis.regions2=GATA.analysis.regions[,c(1,2,3,7,8,9)]
colnames(GATA.analysis.regions2)
#[1] "MCF7_dTAGGATA522_GATA_CC_rep1" "MCF7_dTAGGATA522_GATA_CC_rep2"
#[3] "MCF7_dTAGGATA522_GATA_CC_rep3" "MCF7_dTAGGATA522_GATA_dE_rep1"
#[5] "MCF7_dTAGGATA522_GATA_dE_rep2" "MCF7_dTAGGATA522_GATA_dE_rep3"
library(DESeq2)
sample.conditions2 = factor(sapply(strsplit(colnames(GATA.analysis.regions2), '_rep'), '[', 1))
sample.conditions2
#[1] MCF7_dTAGGATA522_GATA_CC MCF7_dTAGGATA522_GATA_CC MCF7_dTAGGATA522_GATA_CC
#[4] MCF7_dTAGGATA522_GATA_dE MCF7_dTAGGATA522_GATA_dE MCF7_dTAGGATA522_GATA_dE
#Levels: MCF7_dTAGGATA522_GATA_CC MCF7_dTAGGATA522_GATA_dE

deseq.counts.table2 = DESeqDataSetFromMatrix(countData = GATA.analysis.regions2, # DESeqDataSet needs countData to be non-negative integers; non-normalized counts are integer, normalized signals has decimals.
                colData = as.data.frame(sample.conditions2),
                design = ~ sample.conditions2)


GATA.SF2 <- read.table("GATA_reads.txt", sep = '\t', header = TRUE)[,-1][,c(1,2,3,7,8,9)]
#MCF7_dTAGGATA522_GATA_CC_rep1 MCF7_dTAGGATA522_GATA_CC_rep2
#                      33948616                      32585396
#  MCF7_dTAGGATA522_GATA_CC_rep3 MCF7_dTAGGATA522_GATA_dE_rep1
#                      34475586                      34136142
#  MCF7_dTAGGATA522_GATA_dE_rep2 MCF7_dTAGGATA522_GATA_dE_rep3
#                     136271358                      85665512
GATA.size.factors2 = estimateSizeFactorsForMatrix(GATA.SF2)
#MCF7_dTAGGATA522_GATA_CC_rep1 MCF7_dTAGGATA522_GATA_CC_rep2 
#                    0.6821163                     0.6547257 
#MCF7_dTAGGATA522_GATA_CC_rep3 MCF7_dTAGGATA522_GATA_dE_rep1 
#                    0.6927045                     0.6858842 
#MCF7_dTAGGATA522_GATA_dE_rep2 MCF7_dTAGGATA522_GATA_dE_rep3 
#                    2.7380474                     1.7212438 

sizeFactors(deseq.counts.table2) <- GATA.size.factors2 # assign to each column of the count matrix (deseq.counts.table) the size factor to bring each column to a common scale
dds <- DESeq(deseq.counts.table2)
normalized.counts.GATA3.2 = counts(dds, normalized=TRUE)
head(normalized.counts.GATA3.2)
peak.intensities2 = rowMeans(normalized.counts.GATA3.2[,1:3])

head(peak.intensities)
#  chr1:827180-827581   chr1:916569-916970   chr1:924653-925054 
#            45.32469             36.69993             30.44684 
#  chr1:966453-966854   chr1:999308-999709 chr1:1000336-1000737 
#            26.87016             21.22455             24.34263 
head(peak.intensities2)
#  chr1:827180-827581   chr1:916569-916970   chr1:924653-925054 
#            35.78574             28.97613             24.03906 
#  chr1:966453-966854   chr1:999308-999709 chr1:1000336-1000737 
#            21.21512             16.75767             19.21953 

save.image('240108_GATA3_ChIP_deseq_test.Rdata') 

The alteration in input samples affected the size factor determined by DESeq2’s estimateSizeFactorsForMatrix function, consequently impacting the normalized intensity of reads for each sample. However, I don’t believe it is altering the relative differences within each sample—such as which peak exhibits the highest intensity and which exhibits less.

Coherence check-follow up:

module load R/4.1.2 
R 
load('240108_GATA3_ChIP_deseq_test.Rdata')
chr = sapply(strsplit(names(peak.intensities2), ":"), "[", 1)
rnge = sapply(strsplit(names(peak.intensities2), ":"), "[", 2)
start = as.numeric(sapply(strsplit(rnge, "-"), "[", 1)) + 200
end = as.numeric(sapply(strsplit(rnge, "-"), "[", 2)) - 200

quantile(peak.intensities2, probs = seq(.20, 1.00, by = .20))



# 1bp summit quantile file
j =0 
q=seq(.20, 1.00, by = .20)
count=0
for (i in quantile(peak.intensities2, probs = seq(.20, 1.00, by = .20))){
count = count +1

write.table(file = paste0('test_quantile', as.character(q[count]), '_summits.bed'), data.frame(chr[peak.intensities2 > j & peak.intensities2 <= i], start[peak.intensities2 > j & peak.intensities2 <= i], end[peak.intensities2 > j & peak.intensities2 <= i], peak.intensities2[peak.intensities2 > j & peak.intensities2 <= i]), sep = '\t', quote=FALSE, col.names=FALSE, row.names=FALSE )
j = i
}

See besides the normalized intensity, if peak coordinates assigned to each quantile is the same.

for i in test_quantile*.bed
do
wc -l $i
done
#7269 test_quantile0.2_summits.bed
#7461 test_quantile0.4_summits.bed
#7462 test_quantile0.6_summits.bed
#7461 test_quantile0.8_summits.bed
#7462 test_quantile1_summits.bed

awk '{OFS="\t"} {print $1, $2, $3}' quantile1_summits.bed > test1.bed
awk '{OFS="\t"} {print $1, $2, $3}' test_quantile1_summits.bed > test2.bed
diff test1.bed test2.bed 

No difference between the peak regions assigned to each quantile files, despite the difference in the normalized counts.

3.2 Closest GAT to peak summit

In this section, we use bedtools closestBed (refer to: https://bedtools.readthedocs.io/en/latest/content/tools/closest.html) to find the closest GAT to each provided peak summit.
Input:
Input2 -a is the sorted peak summit file (centered 1bp);
Input2 -b is the sorted, and concatenated GAT coordinates file (both plus and minus);

3.2.1 GAT coordinates file

GAT coordinates on full hg38 use read size==1000.

dir=/labs/Guertin/siyu/Sathyan_GATA3_ChIP_pool1_pool2/overrep_3mer/hg38_full_kmer3_rs1000/seqdump/
head ${dir}hg38.3.3.3minus.14_GAT.bed  
head ${dir}hg38.3.3.3plus.36_GAT.bed

head ${dir}hg38.3.3.3minus.36_ATC.bed  
head ${dir}hg38.3.3.3plus.14_ATC.bed

we will concatanate the plus and minus 3mer file together.

cat ${dir}hg38.3.3.3minus.14_GAT.bed ${dir}hg38.3.3.3plus.36_GAT.bed > hg38.3.3.3.30_plus_minus_GAT.bed
cat ${dir}hg38.3.3.3minus.36_ATC.bed ${dir}hg38.3.3.3plus.14_ATC.bed > hg38.3.3.3.30_plus_minus_ATC.bed

wc -l ${dir}hg38.3.3.3minus.14_GAT.bed #38717592
wc -l ${dir}hg38.3.3.3plus.36_GAT.bed #39090736
wc -l hg38.3.3.3.30_plus_minus_GAT.bed #77808328
#38717592+39090736
#[1] 77808328

wc -l ${dir}hg38.3.3.3minus.36_ATC.bed #39086237
wc -l ${dir}hg38.3.3.3plus.14_ATC.bed #38725373
wc -l hg38.3.3.3.30_plus_minus_ATC.bed #77811610

#39086237+38725373
#[1] 77811610
wc -l hg38.3.3.3.30_plus_minus_GAT.bed #77808328
head -5 hg38.3.3.3.30_plus_minus_GAT.bed
#chr1   10545   10548   14  14  -   GAT
#chr1   11145   11148   14  14  -   GAT
#chr1   11160   11163   14  14  -   GAT
#chr1   11576   11579   14  14  -   GAT
#chr1   11597   11600   14  14  -   GAT

load files contains 3mer coordinates info

#concatenate the plus and minus file together
all.GAT.file=read.table(file = "hg38.3.3.3.30_plus_minus_GAT.bed", sep="\t", header=FALSE)
head(all.GAT.file)
#    V1    V2    V3 V4 V5 V6  V7
#
tail(all.GAT.file)
#                             V1    V2    V3 V4 V5 V6  V7
#

nrow(all.GAT.file)
#[1] 

3.2.2 20% quantile peaks without motifs 12345678

load files contains selected ChIP peak summit info

chip.peak.summit.quantile1=read.table("quantile1_summits.bed", header=FALSE)
nrow(chip.peak.summit.quantile1)
head(chip.peak.summit.quantile1)

chip.peak.summit.quantile0.8=read.table("quantile0.8_summits.bed", header=FALSE)
chip.peak.summit.quantile0.6=read.table("quantile0.6_summits.bed", header=FALSE)
chip.peak.summit.quantile0.4=read.table("quantile0.4_summits.bed", header=FALSE)
chip.peak.summit.quantile0.2=read.table("quantile0.2_summits.bed", header=FALSE)
nrow(chip.peak.summit.quantile0.8)
head(chip.peak.summit.quantile0.8)
summary(chip.peak.summit.quantile0.8$V4)
nrow(chip.peak.summit.quantile0.6)
head(chip.peak.summit.quantile0.6)
summary(chip.peak.summit.quantile0.6$V4)
nrow(chip.peak.summit.quantile0.4)
head(chip.peak.summit.quantile0.4)
summary(chip.peak.summit.quantile0.4$V4)
nrow(chip.peak.summit.quantile0.2)
head(chip.peak.summit.quantile0.2)
summary(chip.peak.summit.quantile0.2$V4)

bedtools closestBed
The below function will sort input bed1 and bed2 first, then run bedtools closestBed between bed1 and bed2.

# define function 
bedTools.closest <- function(functionstring="/home/FCAM/ssun/packages/bedtools2/bin/closestBed",bed1,bed2,opt.string="") {
  
  options(scipen =99) # not use scientific notation when writing out
  
  #write bed formatted data.frames to tempfile
  write.table(bed1,file= 'a.file.bed', quote=F,sep="\t",col.names=F,row.names=F)
  write.table(bed2,file= 'b.file.bed', quote=F,sep="\t",col.names=F,row.names=F)
  
  # create the command string and call the command using system()
  # the command sort a and b file by coordinates
  command1=paste('sort -k1,1 -k2,2n', 'a.file.bed', '> a.file.sorted.bed')
  cat(command1,"\n") #sort -k1,1 -k2,2n a.file.bed > a.file.sorted.bed
  try(system(command1))
  command2=paste('sort -k1,1 -k2,2n', 'b.file.bed', '> b.file.sorted.bed')
  cat(command2,"\n")
  try(system(command2))
  
  # the command call closestBed on bed1 and bed2
  command=paste(functionstring, opt.string,"-a",'a.file.sorted.bed',"-b",'b.file.sorted.bed',">",'out.file.bed',sep=" ")
  cat(command,"\n")
  try(system(command))
  
  res=read.table('out.file.bed',header=F, comment.char='')
  
  # remove intermediate files
  command3=paste('rm', 'a.file.bed', 'b.file.bed', 'a.file.sorted.bed', 'b.file.sorted.bed', 'out.file.bed')
  cat(command3,"\n")
  try(system(command3))
  
  colnames(res) = c(colnames(bed1), colnames(bed2), 'dis' )
  return(res)
}

Parameter -d will report the distance from the closest GAT to the peak summit.
Parameter -t last or -t first will only report either the last or the first entry in bed2 file if tied distance occurred.
Here we choose to use -t first to report the first coordinates when tie occurred.

Find the closest GAT to peak summit regardless of GAT strandedness

closest.1stGAT.to.GATA.peak.quantile1=bedTools.closest(bed1 = chip.peak.summit.quantile1[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
closest.1stGAT.to.GATA.peak.quantile0.8=bedTools.closest(bed1 = chip.peak.summit.quantile0.8[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
closest.1stGAT.to.GATA.peak.quantile0.6=bedTools.closest(bed1 = chip.peak.summit.quantile0.6[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
closest.1stGAT.to.GATA.peak.quantile0.4=bedTools.closest(bed1 = chip.peak.summit.quantile0.4[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
closest.1stGAT.to.GATA.peak.quantile0.2=bedTools.closest(bed1 = chip.peak.summit.quantile0.2[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')


write.table(closest.1stGAT.to.GATA.peak.quantile1,file= 'closest.1stGAT.to.GATA.peak.quantile1.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(closest.1stGAT.to.GATA.peak.quantile0.8,file= 'closest.1stGAT.to.GATA.peak.quantile0.8.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(closest.1stGAT.to.GATA.peak.quantile0.6,file= 'closest.1stGAT.to.GATA.peak.quantile0.6.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(closest.1stGAT.to.GATA.peak.quantile0.4,file= 'closest.1stGAT.to.GATA.peak.quantile0.4.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(closest.1stGAT.to.GATA.peak.quantile0.2,file= 'closest.1stGAT.to.GATA.peak.quantile0.2.bed', quote=F,sep="\t",col.names=F,row.names=F)


save.image('240108_GATA3_ChIP_clsestbed.Rdata') 

3.2.3 DHS negative controls

load files contains the 4 MCF7 DHS negative controls.

  • “full DHS control”

MCF7DHS_Rep1_161bp_without_motifs_123456_78.bed #92388
MCF7DHS_Rep2_161bp_without_motifs_123456_78.bed #139899

  • “independent DHS control”

MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456_78.bed #69119
MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456_78.bed #113774

# DHS negative ctrl (convert to 1bp summit at center)
library(bigWig)
full.DHS.control.rep1=center.bed(read.table("../ENCODE_DHS_GSE29692/MCF7DHS_Rep1_161bp_without_motifs_123456_78.bed", header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
nrow(full.DHS.control.rep1)
head(full.DHS.control.rep1)

full.DHS.control.rep2=center.bed(read.table("../ENCODE_DHS_GSE29692/MCF7DHS_Rep2_161bp_without_motifs_123456_78.bed", header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
nrow(full.DHS.control.rep2)
head(full.DHS.control.rep2)

indep.DHS.control.rep1=center.bed(read.table("../ENCODE_DHS_GSE29692/MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456_78.bed", header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
nrow(indep.DHS.control.rep1)
head(indep.DHS.control.rep1)

indep.DHS.control.rep2=center.bed(read.table("../ENCODE_DHS_GSE29692/MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456_78.bed", header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
nrow(indep.DHS.control.rep2)
head(indep.DHS.control.rep2)
closest.1stGAT.to.full.DHS.control.rep1=bedTools.closest(bed1 = full.DHS.control.rep1[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
closest.1stGAT.to.full.DHS.control.rep2=bedTools.closest(bed1 = full.DHS.control.rep2[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
closest.1stGAT.to.indep.DHS.control.rep1=bedTools.closest(bed1 = indep.DHS.control.rep1[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
closest.1stGAT.to.indep.DHS.control.rep2=bedTools.closest(bed1 = indep.DHS.control.rep2[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')


write.table(closest.1stGAT.to.full.DHS.control.rep1,file= 'closest.1stGAT.to.full.DHS.control.rep1.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(closest.1stGAT.to.full.DHS.control.rep2,file= 'closest.1stGAT.to.full.DHS.control.rep2.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(closest.1stGAT.to.indep.DHS.control.rep1,file= 'closest.1stGAT.to.indep.DHS.control.rep1.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(closest.1stGAT.to.indep.DHS.control.rep2,file= 'closest.1stGAT.to.indep.DHS.control.rep2.bed', quote=F,sep="\t",col.names=F,row.names=F)


save.image('240108_GATA3_ChIP_clsestbed2.Rdata') 

3.2.4 positive controls

load files contains the 4 positive controls (GATA3 peak set with enriched GATA3-like motif of known structures).

GATA3_peak_161win_with_motif_1.bed
GATA3_peak_161win_with_motif_2.bed
GATA3_peak_161win_with_motif_4.bed
GATA3_peak_161win_with_motif_5.bed

# positive control
library(bigWig)
pos.control.peak.with1=center.bed(read.table("../MAST_positive_control/GATA3_peak_161win_with_motif_1.bed", header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
nrow(pos.control.peak.with1)
head(pos.control.peak.with1)

pos.control.peak.with2=center.bed(read.table("../MAST_positive_control/GATA3_peak_161win_with_motif_2.bed", header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
nrow(pos.control.peak.with2)
head(pos.control.peak.with2)

pos.control.peak.with4=center.bed(read.table("../MAST_positive_control/GATA3_peak_161win_with_motif_4.bed", header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
nrow(pos.control.peak.with4)
head(pos.control.peak.with4)

pos.control.peak.with5=center.bed(read.table("../MAST_positive_control/GATA3_peak_161win_with_motif_5.bed", header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
nrow(pos.control.peak.with5)
head(pos.control.peak.with5)
closest.1stGAT.to.pos.control.motif1=bedTools.closest(bed1 = pos.control.peak.with1[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
closest.1stGAT.to.pos.control.motif2=bedTools.closest(bed1 = pos.control.peak.with2[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
closest.1stGAT.to.pos.control.motif4=bedTools.closest(bed1 = pos.control.peak.with4[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
closest.1stGAT.to.pos.control.motif5=bedTools.closest(bed1 = pos.control.peak.with5[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')


write.table(closest.1stGAT.to.pos.control.motif1,file= 'closest.1stGAT.to.pos.control.motif1.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(closest.1stGAT.to.pos.control.motif2,file= 'closest.1stGAT.to.pos.control.motif2.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(closest.1stGAT.to.pos.control.motif4,file= 'closest.1stGAT.to.pos.control.motif4.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(closest.1stGAT.to.pos.control.motif5,file= 'closest.1stGAT.to.pos.control.motif5.bed', quote=F,sep="\t",col.names=F,row.names=F)


save.image('240108_GATA3_ChIP_clsestbed3.Rdata') 

3.2.5 CDFs

We are generating a cumulative distribution function to assess the proximity of a 3mer GAT to the summit of GATA3 ChIP peaks, potentially utilizing 3mer GAT as a binding sequence, across five quantile files. This evaluation aims to determine if the 3mer GAT sequences are closer to the summit of GATA3 ChIP peaks than they are to the summit of the MCF7 DHS regions, which consist of four files. Additionally, we possess four positive control peaks containing GATA3 motifs with known structures.

3.2.5.1 prepare plotting data frame

In previous analysis, we have used closestBed to find the closest GAT to peak summit (5 quantile peak sets, 4 MCF7 DHS negative control, and 4 positive controls).

GATA3 peaks
closest.1stGAT.to.GATA.peak.quantile0.2.bed
closest.1stGAT.to.GATA.peak.quantile0.4.bed
closest.1stGAT.to.GATA.peak.quantile0.6.bed
closest.1stGAT.to.GATA.peak.quantile0.8.bed
closest.1stGAT.to.GATA.peak.quantile1.bed

These files saved info of 1st closest GAT to GATA3 peak summits. These GATA3 peaks are peaks without MEME/MOTIF found motifs and have ranked by peak intensity.

Notice that V1 to V3 are peak summit coordinates, V4 to V9 are closestBed reported closest motif to each peak summit. The last column (V11) is the distance from closest motif to peak summit.

closest.1stGAT.to.GATA.peak.quantile1=read.table('closest.1stGAT.to.GATA.peak.quantile1.bed',header=F, comment.char='')
head(closest.1stGAT.to.GATA.peak.quantile1)
##     V1      V2      V3   V4      V5      V6 V7 V8 V9 V10 V11
## 1 chr1 5598187 5598188 chr1 5598164 5598167 36 36  + GAT  21
## 2 chr1 8017660 8017661 chr1 8017663 8017666 36 36  + GAT   3
## 3 chr1 8020178 8020179 chr1 8020170 8020173 14 14  - GAT   6
## 4 chr1 8055212 8055213 chr1 8055202 8055205 36 36  + GAT   8
## 5 chr1 8061475 8061476 chr1 8061441 8061444 36 36  + GAT  32
## 6 chr1 8120791 8120792 chr1 8120785 8120788 14 14  - GAT   4
closest.1stGAT.to.GATA.peak.quantile0.8=read.table('closest.1stGAT.to.GATA.peak.quantile0.8.bed',header=F, comment.char='')
head(closest.1stGAT.to.GATA.peak.quantile1)
##     V1      V2      V3   V4      V5      V6 V7 V8 V9 V10 V11
## 1 chr1 5598187 5598188 chr1 5598164 5598167 36 36  + GAT  21
## 2 chr1 8017660 8017661 chr1 8017663 8017666 36 36  + GAT   3
## 3 chr1 8020178 8020179 chr1 8020170 8020173 14 14  - GAT   6
## 4 chr1 8055212 8055213 chr1 8055202 8055205 36 36  + GAT   8
## 5 chr1 8061475 8061476 chr1 8061441 8061444 36 36  + GAT  32
## 6 chr1 8120791 8120792 chr1 8120785 8120788 14 14  - GAT   4
closest.1stGAT.to.GATA.peak.quantile0.6=read.table('closest.1stGAT.to.GATA.peak.quantile0.6.bed',header=F, comment.char='')
head(closest.1stGAT.to.GATA.peak.quantile1)
##     V1      V2      V3   V4      V5      V6 V7 V8 V9 V10 V11
## 1 chr1 5598187 5598188 chr1 5598164 5598167 36 36  + GAT  21
## 2 chr1 8017660 8017661 chr1 8017663 8017666 36 36  + GAT   3
## 3 chr1 8020178 8020179 chr1 8020170 8020173 14 14  - GAT   6
## 4 chr1 8055212 8055213 chr1 8055202 8055205 36 36  + GAT   8
## 5 chr1 8061475 8061476 chr1 8061441 8061444 36 36  + GAT  32
## 6 chr1 8120791 8120792 chr1 8120785 8120788 14 14  - GAT   4
closest.1stGAT.to.GATA.peak.quantile0.4=read.table('closest.1stGAT.to.GATA.peak.quantile0.4.bed',header=F, comment.char='')
head(closest.1stGAT.to.GATA.peak.quantile1)
##     V1      V2      V3   V4      V5      V6 V7 V8 V9 V10 V11
## 1 chr1 5598187 5598188 chr1 5598164 5598167 36 36  + GAT  21
## 2 chr1 8017660 8017661 chr1 8017663 8017666 36 36  + GAT   3
## 3 chr1 8020178 8020179 chr1 8020170 8020173 14 14  - GAT   6
## 4 chr1 8055212 8055213 chr1 8055202 8055205 36 36  + GAT   8
## 5 chr1 8061475 8061476 chr1 8061441 8061444 36 36  + GAT  32
## 6 chr1 8120791 8120792 chr1 8120785 8120788 14 14  - GAT   4
closest.1stGAT.to.GATA.peak.quantile0.2=read.table('closest.1stGAT.to.GATA.peak.quantile0.2.bed',header=F, comment.char='')
head(closest.1stGAT.to.GATA.peak.quantile1)
##     V1      V2      V3   V4      V5      V6 V7 V8 V9 V10 V11
## 1 chr1 5598187 5598188 chr1 5598164 5598167 36 36  + GAT  21
## 2 chr1 8017660 8017661 chr1 8017663 8017666 36 36  + GAT   3
## 3 chr1 8020178 8020179 chr1 8020170 8020173 14 14  - GAT   6
## 4 chr1 8055212 8055213 chr1 8055202 8055205 36 36  + GAT   8
## 5 chr1 8061475 8061476 chr1 8061441 8061444 36 36  + GAT  32
## 6 chr1 8120791 8120792 chr1 8120785 8120788 14 14  - GAT   4

negative control
closest.1stGAT.to.full.DHS.control.rep1.bed
closest.1stGAT.to.indep.DHS.control.rep1.bed
closest.1stGAT.to.full.DHS.control.rep2.bed
closest.1stGAT.to.indep.DHS.control.rep2.bed

closest.1stGAT.to.full.DHS.control.rep1=read.table('closest.1stGAT.to.full.DHS.control.rep1.bed',header=F, comment.char='')
closest.1stGAT.to.full.DHS.control.rep2=read.table('closest.1stGAT.to.full.DHS.control.rep2.bed',header=F, comment.char='')


closest.1stGAT.to.indep.DHS.control.rep1=read.table('closest.1stGAT.to.indep.DHS.control.rep1.bed',header=F, comment.char='')
closest.1stGAT.to.indep.DHS.control.rep2=read.table('closest.1stGAT.to.indep.DHS.control.rep2.bed',header=F, comment.char='')


head(closest.1stGAT.to.full.DHS.control.rep1)
##     V1     V2     V3   V4     V5     V6 V7 V8 V9 V10 V11
## 1 chr1 268044 268045 chr1 268046 268049 36 36  + GAT   2
## 2 chr1 586135 586136 chr1 586125 586128 36 36  + GAT   8
## 3 chr1 629975 629976 chr1 630013 630016 14 14  - GAT  38
## 4 chr1 630235 630236 chr1 630266 630269 14 14  - GAT  31
## 5 chr1 630555 630556 chr1 630559 630562 14 14  - GAT   4
## 6 chr1 631455 631456 chr1 631414 631417 14 14  - GAT  39
nrow(closest.1stGAT.to.full.DHS.control.rep1)
## [1] 92388
nrow(closest.1stGAT.to.full.DHS.control.rep1[closest.1stGAT.to.full.DHS.control.rep1$V9=="+",])
## [1] 46316
nrow(closest.1stGAT.to.full.DHS.control.rep1[closest.1stGAT.to.full.DHS.control.rep1$V9=="-",])
## [1] 46072

positive control
closest.1stGAT.to.pos.control.motif1.bed
closest.1stGAT.to.pos.control.motif2.bed
closest.1stGAT.to.pos.control.motif4.bed
closest.1stGAT.to.pos.control.motif5.bed

closest.1stGAT.to.pos.control.motif1=read.table('closest.1stGAT.to.pos.control.motif1.bed',header=F, comment.char='')
closest.1stGAT.to.pos.control.motif2=read.table('closest.1stGAT.to.pos.control.motif2.bed',header=F, comment.char='')
closest.1stGAT.to.pos.control.motif4=read.table('closest.1stGAT.to.pos.control.motif4.bed',header=F, comment.char='')
closest.1stGAT.to.pos.control.motif5=read.table('closest.1stGAT.to.pos.control.motif5.bed',header=F, comment.char='')

head(closest.1stGAT.to.pos.control.motif1)
##     V1      V2      V3   V4      V5      V6 V7 V8 V9 V10 V11
## 1 chr1 1505021 1505022 chr1 1505049 1505052 36 36  + GAT  28
## 2 chr1 1883593 1883594 chr1 1883591 1883594 14 14  - GAT   0
## 3 chr1 5603456 5603457 chr1 5603454 5603457 14 14  - GAT   0
## 4 chr1 6795912 6795913 chr1 6795908 6795911 36 36  + GAT   2
## 5 chr1 7549578 7549579 chr1 7549563 7549566 36 36  + GAT  13
## 6 chr1 8289429 8289430 chr1 8289430 8289433 36 36  + GAT   1
nrow(closest.1stGAT.to.pos.control.motif1)
## [1] 7431
nrow(closest.1stGAT.to.pos.control.motif1[closest.1stGAT.to.pos.control.motif1$V9=="+",])
## [1] 3698
nrow(closest.1stGAT.to.pos.control.motif1[closest.1stGAT.to.pos.control.motif1$V9=="-",])
## [1] 3733

combine data

GATA3 peak info + distance + status.

df.chip = data.frame(matrix(nrow = 0, ncol = 5))     
colnames(df.chip) = c("V1","V2","v3", "dis", "status")
for (chip.peak in Sys.glob(file.path("./closest.1stGAT.to.GATA.peak.quantile*.bed"))) {
    print(chip.peak)
    quantile.name =strsplit((strsplit(strsplit(chip.peak, "/")[[1]][length(strsplit(chip.peak, "/")[[1]])], 'closest.1stGAT.to.GATA.peak.')[[1]][2]), ".bed")[[1]][1]
    print(quantile.name)
    all.distance = cbind(read.table(chip.peak,header=F, comment.char='')[,c(1:3, 11)], quantile.name)
    df.chip = rbind(df.chip, all.distance)
}
## [1] "./closest.1stGAT.to.GATA.peak.quantile0.2.bed"
## [1] "quantile0.2"
## [1] "./closest.1stGAT.to.GATA.peak.quantile0.4.bed"
## [1] "quantile0.4"
## [1] "./closest.1stGAT.to.GATA.peak.quantile0.6.bed"
## [1] "quantile0.6"
## [1] "./closest.1stGAT.to.GATA.peak.quantile0.8.bed"
## [1] "quantile0.8"
## [1] "./closest.1stGAT.to.GATA.peak.quantile1.bed"
## [1] "quantile1"
str(df.chip)
## 'data.frame':    37115 obs. of  5 variables:
##  $ V1           : chr  "chr1" "chr1" "chr1" "chr1" ...
##  $ V2           : int  924853 966653 999508 1000536 1001891 1020187 1069549 1079599 1163013 1163246 ...
##  $ V3           : int  924854 966654 999509 1000537 1001892 1020188 1069550 1079600 1163014 1163247 ...
##  $ V11          : int  0 65 61 91 46 143 37 29 213 120 ...
##  $ quantile.name: chr  "quantile0.2" "quantile0.2" "quantile0.2" "quantile0.2" ...
unique(df.chip$quantile.name)
## [1] "quantile0.2" "quantile0.4" "quantile0.6" "quantile0.8" "quantile1"
colnames(df.chip) = c("V1","V2","V3", "dis", "status")
head(df.chip)
##     V1      V2      V3 dis      status
## 1 chr1  924853  924854   0 quantile0.2
## 2 chr1  966653  966654  65 quantile0.2
## 3 chr1  999508  999509  61 quantile0.2
## 4 chr1 1000536 1000537  91 quantile0.2
## 5 chr1 1001891 1001892  46 quantile0.2
## 6 chr1 1020187 1020188 143 quantile0.2

negative control (MCF7 DHS regions) info + distance +status.

df.neg = data.frame(matrix(nrow = 0, ncol = 5))     
colnames(df.neg) = c("V1","V2","v3", "dis", "status")
for (chip.peak in Sys.glob(file.path("./*DHS.control.rep*.bed"))) {
    print(chip.peak)
    rep.name =strsplit((strsplit(strsplit(chip.peak, "/")[[1]][length(strsplit(chip.peak, "/")[[1]])], 'closest.1stGAT.to.')[[1]][2]), ".bed")[[1]][1]
    print(rep.name)
    all.distance = cbind(read.table(chip.peak,header=F, comment.char='')[,c(1:3, 11)],  rep.name) 
    df.neg = rbind(df.neg, all.distance)
}
## [1] "./closest.1stGAT.to.full.DHS.control.rep1.bed"
## [1] "full.DHS.control.rep1"
## [1] "./closest.1stGAT.to.full.DHS.control.rep2.bed"
## [1] "full.DHS.control.rep2"
## [1] "./closest.1stGAT.to.indep.DHS.control.rep1.bed"
## [1] "indep.DHS.control.rep1"
## [1] "./closest.1stGAT.to.indep.DHS.control.rep2.bed"
## [1] "indep.DHS.control.rep2"
str(df.neg)
## 'data.frame':    415180 obs. of  5 variables:
##  $ V1      : chr  "chr1" "chr1" "chr1" "chr1" ...
##  $ V2      : int  268044 586135 629975 630235 630555 631455 631735 632315 633015 634475 ...
##  $ V3      : int  268045 586136 629976 630236 630556 631456 631736 632316 633016 634476 ...
##  $ V11     : int  2 8 38 31 4 39 44 4 11 0 ...
##  $ rep.name: chr  "full.DHS.control.rep1" "full.DHS.control.rep1" "full.DHS.control.rep1" "full.DHS.control.rep1" ...
colnames(df.neg) = c("V1","V2","V3", "dis", "status")
unique(df.neg$status)
## [1] "full.DHS.control.rep1"  "full.DHS.control.rep2"  "indep.DHS.control.rep1"
## [4] "indep.DHS.control.rep2"
head(df.neg)
##     V1     V2     V3 dis                status
## 1 chr1 268044 268045   2 full.DHS.control.rep1
## 2 chr1 586135 586136   8 full.DHS.control.rep1
## 3 chr1 629975 629976  38 full.DHS.control.rep1
## 4 chr1 630235 630236  31 full.DHS.control.rep1
## 5 chr1 630555 630556   4 full.DHS.control.rep1
## 6 chr1 631455 631456  39 full.DHS.control.rep1

positive control info + distance +status.

df.pos = data.frame(matrix(nrow = 0, ncol = 5))     
colnames(df.pos) = c("V1","V2","v3", "dis", "status")
for (chip.peak in Sys.glob(file.path("./closest.1stGAT.to.pos.control*.bed"))) {
    print(chip.peak)
    rep.name =strsplit((strsplit(strsplit(chip.peak, "/")[[1]][length(strsplit(chip.peak, "/")[[1]])], 'closest.1stGAT.to.')[[1]][2]), ".bed")[[1]][1]
    print(rep.name)
    all.distance = cbind(read.table(chip.peak,header=F, comment.char='')[,c(1:3, 11)],  rep.name) 
    df.pos = rbind(df.pos, all.distance)
}
## [1] "./closest.1stGAT.to.pos.control.motif1.bed"
## [1] "pos.control.motif1"
## [1] "./closest.1stGAT.to.pos.control.motif2.bed"
## [1] "pos.control.motif2"
## [1] "./closest.1stGAT.to.pos.control.motif4.bed"
## [1] "pos.control.motif4"
## [1] "./closest.1stGAT.to.pos.control.motif5.bed"
## [1] "pos.control.motif5"
str(df.pos)
## 'data.frame':    30778 obs. of  5 variables:
##  $ V1      : chr  "chr1" "chr1" "chr1" "chr1" ...
##  $ V2      : int  1505021 1883593 5603456 6795912 7549578 8289429 8395067 8606890 8635981 8641964 ...
##  $ V3      : int  1505022 1883594 5603457 6795913 7549579 8289430 8395068 8606891 8635982 8641965 ...
##  $ V11     : int  28 0 0 2 13 1 8 0 5 9 ...
##  $ rep.name: chr  "pos.control.motif1" "pos.control.motif1" "pos.control.motif1" "pos.control.motif1" ...
colnames(df.pos) = c("V1","V2","V3", "dis", "status")
unique(df.pos$status)
## [1] "pos.control.motif1" "pos.control.motif2" "pos.control.motif4"
## [4] "pos.control.motif5"
head(df.pos)
##     V1      V2      V3 dis             status
## 1 chr1 1505021 1505022  28 pos.control.motif1
## 2 chr1 1883593 1883594   0 pos.control.motif1
## 3 chr1 5603456 5603457   0 pos.control.motif1
## 4 chr1 6795912 6795913   2 pos.control.motif1
## 5 chr1 7549578 7549579  13 pos.control.motif1
## 6 chr1 8289429 8289430   1 pos.control.motif1

merge files.

df.all=rbind(df.chip, df.neg, df.pos)
unique(df.all$status)
##  [1] "quantile0.2"            "quantile0.4"            "quantile0.6"           
##  [4] "quantile0.8"            "quantile1"              "full.DHS.control.rep1" 
##  [7] "full.DHS.control.rep2"  "indep.DHS.control.rep1" "indep.DHS.control.rep2"
## [10] "pos.control.motif1"     "pos.control.motif2"     "pos.control.motif4"    
## [13] "pos.control.motif5"
df.all$status = factor(df.all$status, levels = c("quantile0.2", "quantile0.4", "quantile0.6", "quantile0.8", "quantile1", "full.DHS.control.rep1", "full.DHS.control.rep2", "indep.DHS.control.rep1", "indep.DHS.control.rep2", "pos.control.motif1", "pos.control.motif2", "pos.control.motif4", "pos.control.motif5"))
head(df.all)
##     V1      V2      V3 dis      status
## 1 chr1  924853  924854   0 quantile0.2
## 2 chr1  966653  966654  65 quantile0.2
## 3 chr1  999508  999509  61 quantile0.2
## 4 chr1 1000536 1000537  91 quantile0.2
## 5 chr1 1001891 1001892  46 quantile0.2
## 6 chr1 1020187 1020188 143 quantile0.2
nrow(df.all)
## [1] 483073

3.2.5.2 CDF plot

negative control comparison
First, we want to ask if the MCF7 DHS controls comparable to each other.

library(lattice)
library(latticeExtra)

df.neg$status = factor(df.neg$status, levels = c("full.DHS.control.rep1", "full.DHS.control.rep2", "indep.DHS.control.rep1", "indep.DHS.control.rep2"))

ecdfplot(~log(abs(dis), base = 10), groups = status, data = df.neg,
         auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
         col=c(colorRampPalette(c("blue4","blue"))(2), colorRampPalette(c("grey60","grey30"))(2)),
         aspect = 1,
         #xlim = c(0, 50000),
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = expression('log'[10]~'3mer-GAT Distance from peak summit'),
                                        #index.cond = list(c(2,1)),
         between=list(y=1.0),
         type = 'a',
         xlim = c(0,2.5),
         lwd=2,
         lty=c(1),
         par.settings = list(superpose.line = list(col=c(colorRampPalette(c("blue4","blue"))(2), colorRampPalette(c("grey60","grey30"))(2)), lwd=3), strip.background=list(col="grey85"))
         )

The blue traces are the two reps from full MCF7 DHS regions, the grey traces are the two reps from MCF7 HDS regions without overlapped GATA3 peak regions.
Four negative traces looks very similar to each other.

all in one plot

library(lattice)
library(latticeExtra)

ecdfplot(~log(abs(dis), base = 10), groups = status, data = df.all,
         auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
         col=c(colorRampPalette(c("pink","red"))(4), "#ce228e", colorRampPalette(c("grey85","grey30"))(4), colorRampPalette(c("green","darkgreen"))(4)),
         aspect = 1,
        #xlim = c(0, 50000),
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = expression('log'[10]~'3mer-GAT Distance from peak summit'),
                                        #index.cond = list(c(2,1)),
         between=list(y=1.0),
         type = 'a',
         xlim = c(0,2.5),
         lwd=2,
         lty=c(1),
         par.settings = list(superpose.line = list(col=c(colorRampPalette(c("pink","red"))(4), "#ce228e", colorRampPalette(c("grey85","grey30"))(4), colorRampPalette(c("green","darkgreen"))(4)), lwd=3), strip.background=list(col="grey85"))
         #panel = function(...) {
         #    panel.abline(v= 1.255, lty =2) #log10(17)=1.23 #log10(17)=1.255
         #    panel.ecdfplot(...)
         #}
)

The above cumulative distribution function (PDF) plot is showing the distance between 3mer-GAT to the summit of GATA3 ChIP peak (red), or summit of the ctrl peak set (grey: MCF7 DHS regions; green: positive control).

3.2.5.3 empirically determine binding distance for each quantile–quantile1

We can empirically determine the distance at which the CDFs’ difference between GATA3 peak set and ctrl peak set reach the plateaus (where CDF traces became parallel or converging).

unique(df.all$status)
##  [1] quantile0.2            quantile0.4            quantile0.6           
##  [4] quantile0.8            quantile1              full.DHS.control.rep1 
##  [7] full.DHS.control.rep2  indep.DHS.control.rep1 indep.DHS.control.rep2
## [10] pos.control.motif1     pos.control.motif2     pos.control.motif4    
## [13] pos.control.motif5    
## 13 Levels: quantile0.2 quantile0.4 quantile0.6 quantile0.8 ... pos.control.motif5

Comparing between MCF7 DHS regions (neg control) and GATA3 peak in quantile1 to determined the converging/parallel distance cutoff at ~13bp.

There are 4 negative control, we need to determine one by one:

levels(df.neg$status)
## [1] "full.DHS.control.rep1"  "full.DHS.control.rep2"  "indep.DHS.control.rep1"
## [4] "indep.DHS.control.rep2"

1) quantile1 vs. “full.DHS.control.rep1”

# Specify categories to compare within df.all
match = ecdf(abs(df.all$dis)[df.all$status == 'full.DHS.control.rep1'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile1'])
#match
#Empirical CDF 
#Call: ecdf(abs(df.all$dis)[df.all$status == "full.DHS.control.rep1"])
# x[1:1535] =      0,      1,      2,  ..., 7.1132e+05, 7.3298e+05
#rep
#Empirical CDF 
#Call: ecdf(abs(df.all$dis)[df.all$status == "quantile1"])
# x[1:187] =      0,      1,      2,  ...,    359,    425
match.y = seq(0, 2000, by=1) # creating indices
rep.y = seq(0, 2000, by=1)

spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)

print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1]) # 17; 
## [1] 17
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,60),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile1) CDF - negative ctrl(full.DHS.control.rep1) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,300),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile1) CDF - negative ctrl(full.DHS.control.rep1) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

By plotting the difference between the CDFs of GATA3 peaks (quantile 1) and the negative control (full.DHS.control.rep1), we can pinpoint the point where these two CDF curves converge or reach a similar trajectory.

When comparing GATA3 peaks within quantile 1 to the full.DHS.control.rep1 (MCF7 cells), we observe that at a distance of 17 from the GATA3 peak summit, there are peaks exhibiting a closer resemblance to the GAT 3mer.

library(lattice)
library(latticeExtra)

df1=rbind(df.chip[df.chip$status=="quantile1",], df.neg[df.neg$status=="full.DHS.control.rep1",])
df1$status = factor(df1$status, levels = c("quantile1", "full.DHS.control.rep1"))


ecdfplot(~abs(dis), groups = status, data = df1,
         auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
         col=c("#ce228e", "blue4"),
         aspect = 1,
        #xlim = c(0, 50000),
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = expression('3mer-GAT Distance from peak summit'),
                                        #index.cond = list(c(2,1)),
         between=list(y=1.0),
         type = 'a',
         xlim = c(0,60),
         lwd=2,
         lty=c(1),
         par.settings = list(superpose.line = list(col=c("#ce228e", "blue4"), lwd=3), strip.background=list(col="grey85")),
         panel = function(...) {
             panel.abline(v= 17, lty =2, col="red")
             #panel.abline(v= 18, lty =2, col="grey30")
             panel.ecdfplot(...)
         })

In the above CDF plot, a noticeable left-shift is evident in the closest GAT distance to GATA3 peaks within quantile 1 compared with the negative control. At the marked red dotted line positioned at distance 17, we empirically identify the point where the two traces align in parallel.

We can determine the fraction of peaks with closest GAT (17%):

#df.chip[df.chip$status=="quantile1",]
#df.neg[df.neg$status=="full.DHS.control.rep1",]

frac1=sum(df.chip[df.chip$status=="quantile1",]$dis <= 17) / length(df.chip[df.chip$status=="quantile1",]$dis)
frac2=sum(df.neg[df.neg$status=="full.DHS.control.rep1",]$dis <= 17) / length(df.neg[df.neg$status=="full.DHS.control.rep1",]$dis)
frac1-frac2
## [1] 0.1723328

2) quantile1 vs. “full.DHS.control.rep2”

# Specify categories to compare within df.all
match = ecdf(abs(df.all$dis)[df.all$status == 'full.DHS.control.rep2'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile1'])
#match
#Empirical CDF 
#Call: ecdf(abs(df.all$dis)[df.all$status == "full.DHS.control.rep2"])
# x[1:2283] =      0,      1,      2,  ..., 7.33e+05, 7.3424e+05
#rep
#Empirical CDF 
#Call: ecdf(abs(df.all$dis)[df.all$status == "quantile1"])
# x[1:187] =      0,      1,      2,  ...,    359,    425
match.y = seq(0, 2000, by=1) # creating indices
rep.y = seq(0, 2000, by=1)

spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)

print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1]) # 18; 
## [1] 18
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,60),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile1) CDF - negative ctrl(full.DHS.control.rep2) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,300),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile1) CDF - negative ctrl(full.DHS.control.rep2) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

When comparing GATA3 peaks within quantile 1 to the full.DHS.control.rep2 (MCF7 cells), we observe that at a distance of 18 from the GATA3 peak summit, there are peaks exhibiting a closer resemblance to the GAT 3mer.

library(lattice)
library(latticeExtra)

df1=rbind(df.chip[df.chip$status=="quantile1",], df.neg[df.neg$status=="full.DHS.control.rep2",])
df1$status = factor(df1$status, levels = c("quantile1", "full.DHS.control.rep2"))


ecdfplot(~abs(dis), groups = status, data = df1,
         auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
         col=c("#ce228e", "blue"),
         aspect = 1,
        #xlim = c(0, 50000),
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = expression('3mer-GAT Distance from peak summit'),
                                        #index.cond = list(c(2,1)),
         between=list(y=1.0),
         type = 'a',
         xlim = c(0,60),
         lwd=2,
         lty=c(1),
         par.settings = list(superpose.line = list(col=c("#ce228e", "blue"), lwd=3), strip.background=list(col="grey85")),
         panel = function(...) {
             panel.abline(v= 18, lty =2, col="red")
             #panel.abline(v= 18, lty =2, col="grey30")
             panel.ecdfplot(...)
         })

In the above CDF plot, a noticeable left-shift is evident in the closest GAT distance to GATA3 peaks within quantile 1 compared with the negative control. At the marked red dotted line positioned at distance 18, we empirically identify the point where the two traces align in parallel.

We can determine the fraction of peaks with closest GAT (19.8%):

#df.chip[df.chip$status=="quantile1",]
#df.neg[df.neg$status=="full.DHS.control.rep1",]

frac1=sum(df.chip[df.chip$status=="quantile1",]$dis <= 18) / length(df.chip[df.chip$status=="quantile1",]$dis)
frac2=sum(df.neg[df.neg$status=="full.DHS.control.rep2",]$dis <= 18) / length(df.neg[df.neg$status=="full.DHS.control.rep2",]$dis)
frac1-frac2
## [1] 0.1985438

3) quantile1 vs. “indep.DHS.control.rep1”

# Specify categories to compare within df.all
match = ecdf(abs(df.all$dis)[df.all$status == 'indep.DHS.control.rep1'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile1'])
#match
#Empirical CDF 
#Call: ecdf(abs(df.all$dis)[df.all$status == "indep.DHS.control.rep1"])
# x[1:1449] =      0,      1,      2,  ..., 7.1132e+05, 7.3298e+05
#rep
#Empirical CDF 
#Call: ecdf(abs(df.all$dis)[df.all$status == "quantile1"])
# x[1:187] =      0,      1,      2,  ...,    359,    425
match.y = seq(0, 2000, by=1) # creating indices
rep.y = seq(0, 2000, by=1)

spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)

print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1]) # 17; 
## [1] 17
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,60),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile1) CDF - negative ctrl(indep.DHS.control.rep1) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,300),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile1) CDF - negative ctrl(indep.DHS.control.rep1) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

When comparing GATA3 peaks within quantile 1 to the indep.DHS.control.rep1 (MCF7 cells), we observe that at a distance of 17 from the GATA3 peak summit, there are peaks exhibiting a closer resemblance to the GAT 3mer.

library(lattice)
library(latticeExtra)

df1=rbind(df.chip[df.chip$status=="quantile1",], df.neg[df.neg$status=="indep.DHS.control.rep1",])
df1$status = factor(df1$status, levels = c("quantile1", "indep.DHS.control.rep1"))


ecdfplot(~abs(dis), groups = status, data = df1,
         auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
         col=c("#ce228e", "grey60"),
         aspect = 1,
        #xlim = c(0, 50000),
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = expression('3mer-GAT Distance from peak summit'),
                                        #index.cond = list(c(2,1)),
         between=list(y=1.0),
         type = 'a',
         xlim = c(0,60),
         lwd=2,
         lty=c(1),
         par.settings = list(superpose.line = list(col=c("#ce228e", "grey60"), lwd=3), strip.background=list(col="grey85")),
         panel = function(...) {
             panel.abline(v= 17, lty =2, col="red")
             #panel.abline(v= 18, lty =2, col="grey30")
             panel.ecdfplot(...)
         })

In the above CDF plot, a noticeable left-shift is evident in the closest GAT distance to GATA3 peaks within quantile 1 compared with the negative control. At the marked red dotted line positioned at distance 17, we empirically identify the point where the two traces align in parallel.

We can determine the fraction of peaks with closest GAT (17.6%):

#df.chip[df.chip$status=="quantile1",]
#df.neg[df.neg$status=="full.DHS.control.rep1",]

frac1=sum(df.chip[df.chip$status=="quantile1",]$dis <= 17) / length(df.chip[df.chip$status=="quantile1",]$dis)
frac2=sum(df.neg[df.neg$status=="indep.DHS.control.rep1",]$dis <= 17) / length(df.neg[df.neg$status=="indep.DHS.control.rep1",]$dis)
frac1-frac2
## [1] 0.1759796
  1. “indep.DHS.control.rep2”
# Specify categories to compare within df.all
match = ecdf(abs(df.all$dis)[df.all$status == 'indep.DHS.control.rep2'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile1'])
#match
#Empirical CDF 
#Call: ecdf(abs(df.all$dis)[df.all$status == "indep.DHS.control.rep2"])
# x[1:2207] =      0,      1,      2,  ..., 7.33e+05, 7.3424e+05
#rep
#Empirical CDF 
#Call: ecdf(abs(df.all$dis)[df.all$status == "quantile1"])
# x[1:187] =      0,      1,      2,  ...,    359,    425
match.y = seq(0, 2000, by=1) # creating indices
rep.y = seq(0, 2000, by=1)

spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)

print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1]) # 18; 
## [1] 18
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,60),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile1) CDF - negative ctrl(indep.DHS.control.rep2) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,300),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile1) CDF - negative ctrl(indep.DHS.control.rep2) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

When comparing GATA3 peaks within quantile 1 to the indep.DHS.control.rep2 (MCF7 cells), we observe that at a distance of 18 from the GATA3 peak summit, there are peaks exhibiting a closer resemblance to the GAT 3mer.

library(lattice)
library(latticeExtra)

df1=rbind(df.chip[df.chip$status=="quantile1",], df.neg[df.neg$status=="indep.DHS.control.rep2",])
df1$status = factor(df1$status, levels = c("quantile1", "indep.DHS.control.rep2"))


ecdfplot(~abs(dis), groups = status, data = df1,
         auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
         col=c("#ce228e", "grey60"),
         aspect = 1,
        #xlim = c(0, 50000),
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = expression('3mer-GAT Distance from peak summit'),
                                        #index.cond = list(c(2,1)),
         between=list(y=1.0),
         type = 'a',
         xlim = c(0,60),
         lwd=2,
         lty=c(1),
         par.settings = list(superpose.line = list(col=c("#ce228e", "grey60"), lwd=3), strip.background=list(col="grey85")),
         panel = function(...) {
             panel.abline(v= 18, lty =2, col="red")
             #panel.abline(v= 18, lty =2, col="grey30")
             panel.ecdfplot(...)
         })

In the above CDF plot, a noticeable left-shift is evident in the closest GAT distance to GATA3 peaks within quantile 1 compared with the negative control. At the marked red dotted line positioned at distance 17, we empirically identify the point where the two traces align in parallel.

We can determine the fraction of peaks with closest GAT (20%):

#df.chip[df.chip$status=="quantile1",]
#df.neg[df.neg$status=="full.DHS.control.rep1",]

frac1=sum(df.chip[df.chip$status=="quantile1",]$dis <= 17) / length(df.chip[df.chip$status=="quantile1",]$dis)
frac2=sum(df.neg[df.neg$status=="indep.DHS.control.rep2",]$dis <= 17) / length(df.neg[df.neg$status=="indep.DHS.control.rep2",]$dis)
frac1-frac2
## [1] 0.2000822

3.2.5.4 empirically determine binding distance for each quantile–quantile0.8

1) quantile0.8 vs. “full.DHS.control.rep1”:
2) quantile0.8 vs. “full.DHS.control.rep2”:
3) quantile0.8 vs. “indep.DHS.control.rep1”:
4) quantile0.8 vs. “indep.DHS.control.rep2”:

3.2.5.5 empirically determine binding distance for each quantile–quantile0.6

1) quantile0.6 vs. “full.DHS.control.rep1”:
2) quantile0.6 vs. “full.DHS.control.rep2”:
3) quantile0.6 vs. “indep.DHS.control.rep1”:
4) quantile0.6 vs. “indep.DHS.control.rep2”:

3.2.5.6 empirically determine binding distance for each quantile–quantile0.4

1) quantile0.4 vs. “full.DHS.control.rep1”:
2) quantile0.4 vs. “full.DHS.control.rep2”:
3) quantile0.4 vs. “indep.DHS.control.rep1”:
4) quantile0.4 vs. “indep.DHS.control.rep2”:

3.2.5.7 empirically determine binding distance for each quantile–quantile0.2

1) quantile0.2 vs. “full.DHS.control.rep1”:
2) quantile0.2 vs. “full.DHS.control.rep2”:
3) quantile0.2 vs. “indep.DHS.control.rep1”:
4) quantile0.2 vs. “indep.DHS.control.rep2”:

3.2.6 barchart summary

Summarize the fraction of peaks (in each quantile) that has a closest GAT to its peak summits.

3.3 find the second closest GAT

In this section, we need to consider two things:
One is to define spacing/distance as the relative position of two zinc finger;
the other is to separate the cases with different strand orientation.

4 follow-up

4.1 MACS3 called peaks:

  •   Is IgG read depth comparable to GATA3 read depth? –YES \
#cd GATA3_ChIP_PRO_July2023/ChIP_final/sorted.bam_final/read_depth/
#module load R/4.1.2
#R 
library("lattice") 
df_GATA=read.csv('GATA_sorted_bam_reads.csv')
df_IgG=read.csv('IgG_sorted_bam_reads.csv')

df_sum=rbind(df_GATA[1:3,], df_IgG)
df_sum$supp=c(rep("GATA",3), rep("IgG", 4))
df_sum$rep=c(rep(c(1,2,3),2 ),4)
df_sum$supp2=as.factor(paste(df_sum$treatment,df_sum$rep, sep = "_"))

pdf('barplot_sorted_bam_GATA_IgG_reads1.pdf',width=6,height=8)
print(barchart(aligned_reads~ supp,
         group=treatment,
         stack=T,
         data = df_sum,
         #auto.key=list(space="right"),
         scales = list(x = list(rot = 45)),
         xlab = "libraries",
         ylab = "post-alignment reads (sorted_bam)",
))
dev.off()

pdf('barplot_sorted_bam_GATA_IgG_reads.pdf',width=10,height=8)
print(barchart(aligned_reads~ supp2|supp,
         group=treatment,
         stack=T,
         data = df_sum,
         #auto.key=list(space="right"),
         scales = list(x = list(rot = 45)),
         xlab = "libraries",
         ylab = "post-alignment reads (sorted_bam)",
))
dev.off()
stacked bar plot of GATA3 libraries and IgG libraries

Figure 6: stacked bar plot of GATA3 libraries and IgG libraries

The read depth in most libraries between GATA3 and IgG is comparable. However, IgG CC_rep2 exhibits higher read depth due to its sequencing in the second pool, which contains fewer samples.

  •   Saturation curve (reads vs. number of peaks) \
  1. Randomly subset the GATA3 ChIP-seq data from 10% to 100%, with three replicates for each percentage.
  2. Randomly subset the control IgG ChIP-seq data from 10% to 100%, with four replicates for each percentage.
  3. Utilize MACS3 with the same parameters used for calling GATA3 peaks previously to detect peaks for each subsetted percentage.
  4. Count the number of peaks.
  5. Construct a saturation curve and fit it to an asymptotic regression model.
#! /bin/sh
#SBATCH --job-name=saturation_curve.sh    
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 16
#SBATCH -p general
#SBATCH --qos=general
#SBATCH --mem=128G
#SBATCH --mail-type=ALL
#SBATCH --mail-user=ssun@uchc.edu
#SBATCH -o saturation_curve.sh_%j.out
#SBATCH -e saturation_curve.sh_%j.err
hostname
mkdir temp_macs

module load macs3
module load samtools/1.12

directory=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/sorted.bam_final/

# Define subsampling percentages
subsample_percents=$(seq 10 10 100)

for subsample_percent in $subsample_percents
do
    subsample_val=$(echo "scale=2; $subsample_percent / 100" | bc)
    
    # Sub-sample treatment BAM files
    for treatment_file in ${directory}*_GATA_CC*sorted.bam
    do
        nm=$(echo $treatment_file | awk -F"/" '{print $NF}' | awk -F"MCF7_dTAGGATA522_" '{print $2}')
        samtools view -s $subsample_val -b $treatment_file > subsampled_${subsample_val}_${nm}
    done
    
    # Sub-sample control BAM files
    for control_file in ${directory}*IgG*sorted.bam
    do
        name=$(echo $control_file | awk -F"/" '{print $NF}' | awk -F"MCF7_dTAGGATA522_" '{print $2}')
        samtools view -s $subsample_val -b $control_file > subsampled_${subsample_val}_${name}
    done
    
    # Call peaks using MACS3 with all subsampled treatment and control BAM files
    treatment_files=$(ls subsampled_${subsample_val}_GATA*.bam | tr '\n' ' ')
    control_files=$(ls subsampled_${subsample_val}_IgG*.bam | tr '\n' ' ')

    macs3 callpeak --call-summits -t $treatment_files -c $control_files -n GATA_ChIP_subsample_${subsample_val} -g hs -q 0.01 --keep-dup all -f BAMPE --nomodel --tempdir temp_macs

    # Clean up intermediate subsampled files if needed
    rm subsampled_${subsample_val}_*.bam
done
#module load R/4.1.2 
#R 
count_peaks <- function(file_path) {
  peak_count <- system(paste("wc -l", file_path), intern = TRUE)
  return(as.numeric(strsplit(peak_count, " ")[[1]][1]))
}

# Find all summit.bed files
file_paths <- Sys.glob("/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/Saturation_curve/*summits*.bed")

# Initialize an empty data frame
data <- data.frame(Read_Depth = numeric(0), Peak_Number = numeric(0))

# Read data and count peaks for each file
for (i in seq_along(file_paths)) {
  peak_number <- count_peaks(file_paths[i])
  data <- rbind(data, data.frame(Read_Depth = c(100, seq(10, 90, 10))[i], Peak_Number = peak_number))
}

# Plot the saturation curve with lattice
library(lattice)
pdf('240103_saturation_curve.pdf',width=8,height=5)
print(xyplot(Peak_Number ~ Read_Depth, data = data, type = c("p", "smooth"), col = "blue"))
dev.off()


# Fit asymptotic regression
asymptotic_model <- nls(Peak_Number ~ asymptotic_formula(Read_Depth, a, b, c), 
                        data = data, 
                        start = list(a = mean(data$Peak_Number), b = 0.1, c = 10))
# Define the asymptotic formula (replace this with the formula you want to fit)
asymptotic_formula <- function(x, a, b, c) {
  a * (1 - exp(-b * x)) + c
}

# Plot the saturation curve with lattice and add fitted curve
pdf('240103_saturation_curve2.pdf',width=8,height=5)
print(xyplot(Peak_Number ~ Read_Depth, 
             data = data, 
             type = c("p", "smooth"), 
             col = "blue",
             xlim=c(0, 300),
             ylim=c(0, 150000),
             xlab="percent read depth (%)",
            panel = function(x, y, ...) {
              panel.xyplot(x, y, ...)
              panel.curve(asymptotic_formula(x, coef(asymptotic_model)["a"], coef(asymptotic_model)["b"], coef(asymptotic_model)["c"]), col = "red", add = TRUE)
                         })
)
dev.off()


png('240103_saturation_curve2.png')
print(xyplot(Peak_Number ~ Read_Depth, 
             data = data, 
             type = c("p", "smooth"), 
             col = "blue",
             xlim=c(0, 300),
             ylim=c(0, 150000),
             xlab="percent read depth (%)",
            panel = function(x, y, ...) {
              panel.xyplot(x, y, ...)
              panel.curve(asymptotic_formula(x, coef(asymptotic_model)["a"], coef(asymptotic_model)["b"], coef(asymptotic_model)["c"]), col = "red", add = TRUE)
                         })
)
dev.off()
saturation curve for GATA3 ChIP-seq peaks

Figure 7: saturation curve for GATA3 ChIP-seq peaks

  •   Peak sensitivity and specificity using GATA3-depleted cells and untreated cells. Do sensitive peak sets have a sequence-binding element?  \

4.2 GATA3 structure

  •   Homology of two GATA3 zinc fingers -N-zinc sequence, C-zinc sequences, align them. \

make sure the full structure of zinc finger; and the specificity of each zinc finger binding to the specific sequences.
Sequence Retrieval from Uniprot P23771 · GATA3_HUMAN
zinc finger 1: 263-287: CVNCGATSTPLWRRDGTGHYLCNAC[GLY] ; zinc finger 2: 317-341: CANCQTTTTTLWRRNANGDPVCNAC[GLY]
NCBI BLAST Alignment:

NCBI blast for the two zinc finger

Figure 8: NCBI blast for the two zinc finger

The two zinc finger of GATA3 have 60% identity.

  •   What are the characteristics of the two activation domains? \
  •   What are the characteristics of GATA3 zinc fingers? \

4.3 Software to search for variable spacing motifs.

4.4 Functional check:

  •   Given purified zinc fingers and a DNA template, can we observe different spacing/orientations of binding patterns? (cryoEM or in silicon alphafold to find protein-DNA binding) \